[jira] [Resolved] (HDFS-17575) SaslDataTransferClient should use SaslParticipant to create messages
[ https://issues.apache.org/jira/browse/HDFS-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-17575. --- Hadoop Flags: Reviewed Resolution: Fixed The pull request #6954 is now merged. > SaslDataTransferClient should use SaslParticipant to create messages > > > Key: HDFS-17575 > URL: https://issues.apache.org/jira/browse/HDFS-17575 > Project: Hadoop HDFS > Issue Type: Improvement > Components: security >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Currently, a SaslDataTransferClient may send a message without using its > SaslParticipant as below. {code} > sendSaslMessage(out, new byte[0]); > {code} > Instead, it should use its SaslParticipant to create the response. > {code} > byte[] localResponse = sasl.evaluateChallengeOrResponse(remoteResponse); > sendSaslMessage(out, localResponse); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17599) Fix the mismatch between locations and indices for mover
Tao Li created HDFS-17599: - Summary: Fix the mismatch between locations and indices for mover Key: HDFS-17599 URL: https://issues.apache.org/jira/browse/HDFS-17599 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.4.0, 3.3.0 Reporter: Tao Li Assignee: Tao Li Attachments: image-2024-08-03-17-59-08-059.png, image-2024-08-03-18-00-01-950.png We set the EC policy to (6+3) and also have nodes that were in state ENTERING_MAINTENANCE. When we move the data of some directories from SSD to HDD, some blocks move fail due to disk full, as shown in the figure below (blk_-9223372033441574269). We tried to move again and found the following error "{color:#FF}Replica does not exist{color}". Observing the information of fsck, it can be found that the wrong blockid(blk_-9223372033441574270) was found when moving block. {*}Mover Logs{*}: !image-2024-08-03-17-59-08-059.png|width=741,height=85! {*}FSCK Info{*}: !image-2024-08-03-18-00-01-950.png|width=738,height=120! {*}Root Cause{*}: Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state in the locations is filtered, but the indices are not adapted, resulting in a mismatch between the location and indices lengths. Finally, ec block calculates the wrong blockid when getting internal block (see `DBlockStriped#getInternalBlock`). We added debug logs, and a few key messages are shown below. {color:#FF}The result is an incorrect correspondence: xx.xx.7.31 -> -9223372033441574270{color}. {code:java} DBlock getInternalBlock(StorageGroup storage) { // storage == xx.xx.7.31 // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is filtered) int idxInLocs = locations.indexOf(storage); if (idxInLocs == -1) { return null; } // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8]) byte idxInGroup = indices[idxInLocs]; // blkId: -9223372033441574272 + 2 = -9223372033441574270 long blkId = getBlock().getBlockId() + idxInGroup; long numBytes = getInternalBlockLength(getNumBytes(), cellSize, dataBlockNum, idxInGroup); Block blk = new Block(getBlock()); blk.setBlockId(blkId); blk.setNumBytes(numBytes); DBlock dblk = new DBlock(blk); dblk.addLocation(storage); return dblk; } {code} {*}Solution{*}: When initializing DBlockStriped, if any location is filtered out, we need to remove the corresponding element in the indices to do the adaptation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17544) [ARR] The router client rpc protocol PB supports asynchrony.
[ https://issues.apache.org/jira/browse/HDFS-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He resolved HDFS-17544. Fix Version/s: HDFS-17531 Hadoop Flags: Reviewed Resolution: Fixed > [ARR] The router client rpc protocol PB supports asynchrony. > > > Key: HDFS-17544 > URL: https://issues.apache.org/jira/browse/HDFS-17544 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Jian Zhang >Assignee: Jian Zhang >Priority: Major > Labels: pull-request-available > Fix For: HDFS-17531 > > > *Describe* > In order not to affect other modules, when implementing router asynchronous > client rpc protocolPB, it mainly extends the original protocolPB; The > implemented protocolPB is as follows: > *RouterClientProtocolTranslatorPB* extends ClientNamenodeProtocolTranslatorPB > *RouterGetUserMappingsProtocolTranslatorPB* extends > GetUserMappingsProtocolClientSideTranslatorPB > *RouterNamenodeProtocolTranslatorPB* extends NamenodeProtocolTranslatorPB > *RouterRefreshUserMappingsProtocolTranslatorPB* extends > RefreshUserMappingsProtocolClientSideTranslatorPB > Then let the router's *ConnectionPool* use the aforementioned protocolPB > In the implementation of asynchronous Rpc.client, the main methods used > include HADOOP-13226, HDFS-10224, etc > {*}AsyncRpcProtocolPBUtil{*}: Make the implementation of asynchronous rpc > protocol more concise and clear. > > *Test* > new UTs: > 1.TestAsyncRpcProtocolPBUtil > 2.TestRouterClientSideTranslatorPB -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17598) Optimizations for DatanodeManager for large-scale cases
Hao-Nan Zhu created HDFS-17598: -- Summary: Optimizations for DatanodeManager for large-scale cases Key: HDFS-17598 URL: https://issues.apache.org/jira/browse/HDFS-17598 Project: Hadoop HDFS Issue Type: Improvement Components: performance Affects Versions: 3.4.0 Reporter: Hao-Nan Zhu Hello, I wonder if there are chances to optimize a little bit for the {_}DatanodeManager{_}, for its performance when the number of _datanodes_ is large * [_fetchDatanodes_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1144] calls [_removeDecomNodeFromList_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L817] for both lists for live and dead datanodes. _removeDecomNodeFromList_ will have to iterate all datanodes in the list. This can be optimized by checking whether the node is decommissioned using _node.isDecommissioned()_ before adding the node to the lists of live and dead datanodes. * [_getNumLiveDataNodes_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1055] iterates over all datanodes. However, [_getNumDeadDataNodes_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1068] gets the size in a different (presumably more efficient) way. Is there a reason that _getNumLiveDataNodes_ has to iterate all over the {_}datanodeMap{_}? Can we use the same way for _getNumLiveDataNodes?_ And similar observations for [_resetLastCachingDirectiveSentTime_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1560] and [_getDatanodeListForReport_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1253]. It seems optimizing these methods can contribute to more performant checks, especially when the number of datanodes is larger. Are there any plans on having these types of large-scale (micro) optimizations? Please let me know if I need to provide more information. Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14883) NPE when the second SNN is starting
[ https://issues.apache.org/jira/browse/HDFS-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Li resolved HDFS-14883. --- Resolution: Duplicate > NPE when the second SNN is starting > --- > > Key: HDFS-14883 > URL: https://issues.apache.org/jira/browse/HDFS-14883 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Ranith Sardar >Assignee: Ranith Sardar >Priority: Major > Labels: multi-sbnn > Fix For: 3.4.0, 3.3.1, 2.10.1, 3.2.2 > > > > {{| WARN | qtp79782883-47 | /imagetransfer | ServletHandler.java:632 > java.io.IOException: PutImage failed. java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.ImageServlet.validateRequest(ImageServlet.java:198) > at > org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:485) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17597) [ARR] RouterSnapshot supports asynchronous rpc.
Jian Zhang created HDFS-17597: - Summary: [ARR] RouterSnapshot supports asynchronous rpc. Key: HDFS-17597 URL: https://issues.apache.org/jira/browse/HDFS-17597 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jian Zhang *Describe* The main new addition is RouterAsyncSnapshot, which extends RouterSnapshot so that supports asynchronous rpc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17596) [ARR] RouterStoragePolicy supports asynchronous rpc.
Jian Zhang created HDFS-17596: - Summary: [ARR] RouterStoragePolicy supports asynchronous rpc. Key: HDFS-17596 URL: https://issues.apache.org/jira/browse/HDFS-17596 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jian Zhang *Describe* The main new addition is RouterAsyncStoragePolicy, which extends RouterStoragePolicy so that supports asynchronous rpc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17595) [ARR] ErasureCoding supports asynchronous rpc.
Jian Zhang created HDFS-17595: - Summary: [ARR] ErasureCoding supports asynchronous rpc. Key: HDFS-17595 URL: https://issues.apache.org/jira/browse/HDFS-17595 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jian Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17594) [ARR] RouterCacheAdmin supports asynchronous rpc.
Jian Zhang created HDFS-17594: - Summary: [ARR] RouterCacheAdmin supports asynchronous rpc. Key: HDFS-17594 URL: https://issues.apache.org/jira/browse/HDFS-17594 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jian Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17593) Allow setting block locations when opening streams
Csaba Ringhofer created HDFS-17593: -- Summary: Allow setting block locations when opening streams Key: HDFS-17593 URL: https://issues.apache.org/jira/browse/HDFS-17593 Project: Hadoop HDFS Issue Type: Improvement Reporter: Csaba Ringhofer The HDFS client seems to always get block locations from the namenode when opening a file: https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L1099 This leads to unnecessary RPCs in Apache Impala when doing remote reads, as the block locations are cached globally and the executors already have a good guess about the block locations when opening a stream. Unless the cached block locations are stale ideally no RPC should be made to the namenode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17592) FastCopy support data copy in different nameservices without federation
liuguanghua created HDFS-17592: -- Summary: FastCopy support data copy in different nameservices without federation Key: HDFS-17592 URL: https://issues.apache.org/jira/browse/HDFS-17592 Project: Hadoop HDFS Issue Type: Sub-task Reporter: liuguanghua FastCopy is a faster data copy tools. In federation cluster or a single cluster , FastCopy copy blocks via hardlink. This is more much faster than original copy. FastCopy can support data copy via transfer in different nameservices without federation. In theory, it could save almost half the time compared to origianl copy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17591) RBF: Router should follow X-FRAME-OPTIONS protection setting
Takanobu Asanuma created HDFS-17591: --- Summary: RBF: Router should follow X-FRAME-OPTIONS protection setting Key: HDFS-17591 URL: https://issues.apache.org/jira/browse/HDFS-17591 Project: Hadoop HDFS Issue Type: Task Reporter: Takanobu Asanuma Assignee: Takanobu Asanuma Router UI doesn't have X-FRAME-OPTIONS in its header. Router should load the value of dfs.xframe.value. This issue is reported by Daiki Mashima. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17590) `NullPointerException` triggered in `createBlockReader` during retry iteration
Elmer J Fudd created HDFS-17590: --- Summary: `NullPointerException` triggered in `createBlockReader` during retry iteration Key: HDFS-17590 URL: https://issues.apache.org/jira/browse/HDFS-17590 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.4.0 Reporter: Elmer J Fudd While reading blocks of data using `DFSInputStream` in `createBlockReader`, a `IOException` originating from `getBlockAt()` that triggers a retry iteration results in a `NullPointerException` when passing `dnInfo` to `addToLocalDeadNodes` in the catch block. This is the relevant callstack portion from our logs (from 3.4.0, but this was occurring with "trunk" versions as new as late June which 3.4.1 builds upon): {noformat} ... java.lang.NullPointerException at java.base/ java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011) at java.base/java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006) at org.apache.hadoop.hdfs.DFSInputStream.addToLocalDeadNodes(DFSInputStream.java:184) at org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:279) at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:304) at org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:335) at org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:504) at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1472) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1436) at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:124) at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:119) ...{noformat} What we observe is that `getBlockAt()` thorws an `IOException` [here|] {code:java} //check offset if (offset < 0 || offset >= getFileLength()) { throw new IOException("offset < 0 || offset >= getFileLength(), offset=" + offset + ", locatedBlocks=" + locatedBlocks); } {code} This is eventually caught in `createBlockReader`. The catch block attempts to handle the error and, as part of the error handling, invokes the *`*addToLocalDeadNodes` method. See that a `dnInfo` object passed to this method is `NULL` as it wasn't fully allocated [here|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L247], which results in a `NullPointerException`. To sum up, this is the failure path according to the logs: # `IOException` is thrown in `getBlockAt` ([code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L479]) # The exception propagates to `getBlockGroupAt` ([code|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L476]) # It further propagates to `refreshLocatedBlock` ([code|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L459] # `IOException` caught in `createBlockReader` ([code|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L247]) # Error handling in the catch block of `createBlockReader` invokes `addToLocalDeadNodes` ([code|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L281]) # Execution throws `NullPointerException` since `dnInfo` is NULL A simple fix as a `NULL` check to add only non-NULL `dnInfo` objects to the hash map, and similarly adjusting the log messages in the `catch` block, should solve the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16690) Automatically format new unformatted JournalNodes using JournalNodeSyncer
[ https://issues.apache.org/jira/browse/HDFS-16690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He resolved HDFS-16690. Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Automatically format new unformatted JournalNodes using JournalNodeSyncer > -- > > Key: HDFS-16690 > URL: https://issues.apache.org/jira/browse/HDFS-16690 > Project: Hadoop HDFS > Issue Type: Improvement > Components: journal-node >Affects Versions: 3.4.0, 3.3.5 > Environment: Demonstrated in a Kubernetes environment running Java 11. > # Start new cluster, but short 1 JN (minimum quorum, and the missing JN > won’t resolve). VERIFY: > - NN formats the 2 existing JN and stabilizes. NOTE: Formatting using just > a quorum will be a separate submission > - Messages show sync between JN-0 and JN-1, and NN -> JN. > # Scale JN stateful set to add missing JN. VERIFY: > - New JN starts > - All other JN and all NN report IP address change (IP Address resolution). > NOTE: require HADOOP-18365 and HDFS-16688 > - Messages show sync between all JN, and NN -> JN > - New JN is formatted at least once (possibly by multiple other JN) > - New JN storage directory is formatted only once > - New JN joins cluster (lastWriterEpoch is non-zero) >Reporter: Steve Vaughan >Assignee: Aswin M Prabhu >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > If an unformatted JournalNode is added to an existing JournalNode set, > instances of the JournalNodeSyncer are unable to sync to the new node. When > a sync receives a JournalNotFormattedException, we can initiate a format > operation, and then retry the synchronization. > Conceptually this means that the JournalNodes and their data can be managed > independently from the rest of the system, as the JournalNodes will > incorporate new JournalNode instances. Once the new JournalNode is > formatted, it can participate in shared edits from the NameNodes. > I've been testing an update to the InterQJournalProtocol to add a format call > like that used by the NameNode. Current tests include starting an HA cluster > from scratch, but with 2 JournalNode instances. Once the cluster is up, I > can add the 3rd JournalNode (which is unformatted), and the other 2 > JournalNodes will eventually attempt to sync which results in a formatting > and subsequent sync. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17589) hdfs EC data old blk reconstruct old blk not delete
ruiliang created HDFS-17589: --- Summary: hdfs EC data old blk reconstruct old blk not delete Key: HDFS-17589 URL: https://issues.apache.org/jira/browse/HDFS-17589 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.1.1 Reporter: ruiliang The reason is that the cluster was faulty before, and Datanodes kept losing connections and recovering, resulting in a lot of EC data reconstruct, but a lot of old blk failed to clean up correctly. Has this been repaired? What patch do I need to add, thank you The following is a detailed check log {code:java} datanode delete data ec blk ? grep blk_-9223372036371044656 hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log 2024-07-18 17:25:07,879 INFO datanode.DataNode (DataXceiver.java:writeBlock(738)) - Receiving BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 2024-07-18 17:25:17,396 INFO datanode.DataNode (StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 blockId: -9223372036371044656 2024-07-18 17:25:17,396 INFO datanode.DataNode (DataXceiver.java:writeBlock(914)) - Received BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560 2024-07-18 17:25:25,465 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling blk_-9223372036371044656_1688858793 replica FinalizedReplica, blk_-9223372036371044656_1688858793, FINALIZED getBlockURI() = file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656 for deletion 2024-07-18 17:25:25,746 INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:run(333)) - Deleted BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 URI file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my config dfs.blockreport.intervalMsec =2160namenode3 log hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:39,523 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 04:34:40,131 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:38,950 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18 10:34:39,559 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:38,564 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18 16:34:39,190 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:39,462 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 04:34:40,083 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17 10:34:39,686 WARN BlockStateChange (BlockManager.java:addStoredBlock(3238)) - BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to storageType DISK on node 10.12.66.154:1019 hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17
[jira] [Created] (HDFS-17588) RBF: Clients using RouterObserverReadProxyProvider should first perform msync.
fuchaohong created HDFS-17588: - Summary: RBF: Clients using RouterObserverReadProxyProvider should first perform msync. Key: HDFS-17588 URL: https://issues.apache.org/jira/browse/HDFS-17588 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Reporter: fuchaohong When using RouterObserverReadProxyProvider to initiate the first RPC request, the router routes this RPC to the active namenode and updates the stateid of the corresponding nameservice. However, the stateid of other nameservices is not updated. Clients should first perform msync to update the stateid of all nameservices with enabled Observers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17587) "StorageTypeStats" Metric should not include decommissioned/in-maintenance nodes
Mohamed Aashif created HDFS-17587: - Summary: "StorageTypeStats" Metric should not include decommissioned/in-maintenance nodes Key: HDFS-17587 URL: https://issues.apache.org/jira/browse/HDFS-17587 Project: Hadoop HDFS Issue Type: Bug Components: namenode Reporter: Mohamed Aashif -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-17575) SaslDataTransferClient should use SaslParticipant to create messages
[ https://issues.apache.org/jira/browse/HDFS-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze reopened HDFS-17575: --- The [pull request 6933 |https://github.com/apache/hadoop/pull/6933] has caused test failure. Reverted it. > SaslDataTransferClient should use SaslParticipant to create messages > > > Key: HDFS-17575 > URL: https://issues.apache.org/jira/browse/HDFS-17575 > Project: Hadoop HDFS > Issue Type: Improvement > Components: security >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Currently, a SaslDataTransferClient may send a message without using its > SaslParticipant as below. {code} > sendSaslMessage(out, new byte[0]); > {code} > Instead, it should use its SaslParticipant to create the response. > {code} > byte[] localResponse = sasl.evaluateChallengeOrResponse(remoteResponse); > sendSaslMessage(out, localResponse); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17576) Support user defined auth Callback
[ https://issues.apache.org/jira/browse/HDFS-17576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-17576. --- Fix Version/s: 3.3.7 Hadoop Flags: Reviewed Resolution: Fixed The pull request is now merged. > Support user defined auth Callback > -- > > Key: HDFS-17576 > URL: https://issues.apache.org/jira/browse/HDFS-17576 > Project: Hadoop HDFS > Issue Type: Improvement > Components: security >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > Fix For: 3.3.7 > > > Some security provider may define a new > javax.security.auth.callback.Callback. This JIRA is to allow users to > configure a customized callback handler in such case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17575) SaslDataTransferClient should use SaslParticipant to create messages
[ https://issues.apache.org/jira/browse/HDFS-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-17575. --- Fix Version/s: 3.3.7 Resolution: Fixed The pull request is now merged. > SaslDataTransferClient should use SaslParticipant to create messages > > > Key: HDFS-17575 > URL: https://issues.apache.org/jira/browse/HDFS-17575 > Project: Hadoop HDFS > Issue Type: Improvement > Components: security >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > Fix For: 3.3.7 > > > Currently, a SaslDataTransferClient may send a message without using its > SaslParticipant as below. {code} > sendSaslMessage(out, new byte[0]); > {code} > Instead, it should use its SaslParticipant to create the response. > {code} > byte[] localResponse = sasl.evaluateChallengeOrResponse(remoteResponse); > sendSaslMessage(out, localResponse); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17586) Fix timestamp in rbfbalance tool.
Zhaobo Huang created HDFS-17586: --- Summary: Fix timestamp in rbfbalance tool. Key: HDFS-17586 URL: https://issues.apache.org/jira/browse/HDFS-17586 Project: Hadoop HDFS Issue Type: Bug Reporter: Zhaobo Huang When the 'Federation Balance' Tool calls the 'DistCp' tool, the timestamp is not retained. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17584) DistributedFileSystem verifyChecksum should be configurable
Liangjun He created HDFS-17584: -- Summary: DistributedFileSystem verifyChecksum should be configurable Key: HDFS-17584 URL: https://issues.apache.org/jira/browse/HDFS-17584 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-client Reporter: Liangjun He Assignee: Liangjun He In some of our POC scenarios, we would like to set the verifyChecksum of DistributedFileSystem to false, but currently, verifyChecksum is not configurable and the default value is true. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17583) Support auto-refresh of latest viewDFS mount configuration
Palakur Eshwitha Sai created HDFS-17583: --- Summary: Support auto-refresh of latest viewDFS mount configuration Key: HDFS-17583 URL: https://issues.apache.org/jira/browse/HDFS-17583 Project: Hadoop HDFS Issue Type: Improvement Reporter: Palakur Eshwitha Sai Assignee: Palakur Eshwitha Sai Currently, the central mount table configuration feature for viewDFS loads the latest mounts each time the viewDFS initialize function is called. But in the case of use cases with hive, hive does the call to initialize the fs only for the first time after HS2 restart. From then on, it fetches the filesystem/ resolved mount points from the cache. This requires HS2 restart each time some data is moved to S3 or other hadoop compatible file systems and a new mount-table.xml file is added to the viewDFS central mount config directory, which is not ideal. Should implement a way in which the mount table is auto loaded at specific intervals/ each time the central mount-table directory is updated with a new mount-table.xml file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17582) Distcp support fastcopy
liuguanghua created HDFS-17582: -- Summary: Distcp support fastcopy Key: HDFS-17582 URL: https://issues.apache.org/jira/browse/HDFS-17582 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: liuguanghua Assignee: liuguanghua DistCp support fastcopy for distribute data replication cross same nameservice or different nameservices in hdfs federation cluster. This is depend on # HDFS-16757 # HDFS-17581 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17581) Add FastCopy tool and support dfs -fastcp command
liuguanghua created HDFS-17581: -- Summary: Add FastCopy tool and support dfs -fastcp command Key: HDFS-17581 URL: https://issues.apache.org/jira/browse/HDFS-17581 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: liuguanghua Assignee: liuguanghua Add FastCopy Tool : (1) support data replication with replication files (2) support data replication with ec files And add hdfs dfs -fastcp command for copy file use fastcopy. And the fastcp is similar to cp command -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17580) Change the default value of dfs.datanode.lock.fair to false due to potential hang
[ https://issues.apache.org/jira/browse/HDFS-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] farmmamba resolved HDFS-17580. -- Resolution: Won't Fix Only change to false can not prevent datanode hang. We need some other methods. > Change the default value of dfs.datanode.lock.fair to false due to potential > hang > - > > Key: HDFS-17580 > URL: https://issues.apache.org/jira/browse/HDFS-17580 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17580) Change the default value of dfs.datanode.lock.fair to false
farmmamba created HDFS-17580: Summary: Change the default value of dfs.datanode.lock.fair to false Key: HDFS-17580 URL: https://issues.apache.org/jira/browse/HDFS-17580 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 3.4.0 Reporter: farmmamba Assignee: farmmamba -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17579) [DatanodeAdminDefaultMonitor] Better Comments Explaining Why Blocks Need Reconstruction May Not Block Decommission/Maintenance
wuchang created HDFS-17579: -- Summary: [DatanodeAdminDefaultMonitor] Better Comments Explaining Why Blocks Need Reconstruction May Not Block Decommission/Maintenance Key: HDFS-17579 URL: https://issues.apache.org/jira/browse/HDFS-17579 Project: Hadoop HDFS Issue Type: Improvement Components: dfsadmin Reporter: wuchang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17566) Got wrong sorted block order when StorageType is considered.
[ https://issues.apache.org/jira/browse/HDFS-17566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He resolved HDFS-17566. Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Got wrong sorted block order when StorageType is considered. > > > Key: HDFS-17566 > URL: https://issues.apache.org/jira/browse/HDFS-17566 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > I found unit test failures like below: > ``` > [ERROR] Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: > 9.146 s <<< FAILURE! - in > org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager > [ERROR] > testGetBlockLocationConsiderStorageType(org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager) > Time elapsed: 0.206 s <<< FAILURE! > org.junit.ComparisonFailure: expected: but was: > at org.junit.Assert.assertEquals(Assert.java:117) > at org.junit.Assert.assertEquals(Assert.java:146) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at org.junit.runners.ParentRunner.run(ParentRunner.java:413) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > ``` > > The reason is that in HDFS-17098 comparator order is wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17578) ShellCommandFencer#setConfAsEnvVars should also replace '-' with '_'.
fuchaohong created HDFS-17578: - Summary: ShellCommandFencer#setConfAsEnvVars should also replace '-' with '_'. Key: HDFS-17578 URL: https://issues.apache.org/jira/browse/HDFS-17578 Project: Hadoop HDFS Issue Type: Improvement Reporter: fuchaohong When setting configuration into environment variables, '-' should also be replaced with '_'. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17577) Add Support for CreateFlag.NO_LOCAL_WRITE in File Creation to Manage Disk Space and Network Load in Labeled YARN Nodes
liang yu created HDFS-17577: --- Summary: Add Support for CreateFlag.NO_LOCAL_WRITE in File Creation to Manage Disk Space and Network Load in Labeled YARN Nodes Key: HDFS-17577 URL: https://issues.apache.org/jira/browse/HDFS-17577 Project: Hadoop HDFS Issue Type: New Feature Components: dfsclient Reporter: liang yu {*}Description{*}: I am currently using Apache Flink to write files into Hadoop. The Flink application runs on a labeled YARN queue. During operation, it has been observed that the local disks on these labeled nodes get filled up quickly, and the network load is significantly high. This issue arises because Hadoop prioritizes writing files to the local node first, and the number of these labeled nodes is quite limited. {*}Problem{*}: The current behavior leads to inefficient disk space utilization and high network traffic on these few labeled nodes, which could potentially affect the performance and reliability of the application. {*}Implementation{*}: The implementation would involve adding an configuration _dfs.client.write.no_local_write_ to support the {{CreateFlag.NO_LOCAL_WRITE}} during the file creation process in Hadoop's file system APIs. This will provide flexibility to applications like Flink running in labeled queues to opt for non-local writes when necessary. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17575) SaslDataTransferClient should use SaslParticipant to create messages
Tsz-wo Sze created HDFS-17575: - Summary: SaslDataTransferClient should use SaslParticipant to create messages Key: HDFS-17575 URL: https://issues.apache.org/jira/browse/HDFS-17575 Project: Hadoop HDFS Issue Type: Improvement Components: security Reporter: Tsz-wo Sze Assignee: Tsz-wo Sze Currently, a SaslDataTransferClient may send a message without using its SaslParticipant as below. {code} sendSaslMessage(out, new byte[0]); {code} Instead, it should use its SaslParticipant to create the response. {code} byte[] localResponse = sasl.evaluateChallengeOrResponse(remoteResponse); sendSaslMessage(out, localResponse); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-10535) Rename AsyncDistributedFileSystem
[ https://issues.apache.org/jira/browse/HDFS-10535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-10535. --- Resolution: Won't Fix This JIRA became stale. > Rename AsyncDistributedFileSystem > - > > Key: HDFS-10535 > URL: https://issues.apache.org/jira/browse/HDFS-10535 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Attachments: h10535_20160616.patch > > > Per discussion in HDFS-9924, AsyncDistributedFileSystem is not a good name > since we only support nonblocking calls for the moment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-11948) Ozone: change TestRatisManager to check cluster with data
[ https://issues.apache.org/jira/browse/HDFS-11948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-11948. --- Resolution: Won't Fix This JIRA became stale. > Ozone: change TestRatisManager to check cluster with data > - > > Key: HDFS-11948 > URL: https://issues.apache.org/jira/browse/HDFS-11948 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: OzonePostMerge > Attachments: HDFS-11948-HDFS-7240.20170614.patch, > HDFS-11948-HDFS-7240.20170731.patch > > > TestRatisManager first creates multiple Ratis clusters. Then it changes the > membership and closes some clusters. However, it does not test the clusters > with data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-11734) Ozone: provide a way to validate ContainerCommandRequestProto
[ https://issues.apache.org/jira/browse/HDFS-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-11734. --- Resolution: Won't Fix This JIRA became stale. > Ozone: provide a way to validate ContainerCommandRequestProto > - > > Key: HDFS-11734 > URL: https://issues.apache.org/jira/browse/HDFS-11734 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone >Reporter: Tsz-wo Sze >Assignee: Anu Engineer >Priority: Critical > Labels: OzonePostMerge, tocheck > > We need some API to check if a ContainerCommandRequestProto is valid. > It is useful when the container pipeline is run with Ratis. Then, the leader > could first checks if a ContainerCommandRequestProto is valid before the > request is propagated to the followers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-11735) Ozone: In Ratis, leader should validate ContainerCommandRequestProto before propagating it to followers
[ https://issues.apache.org/jira/browse/HDFS-11735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-11735. --- Resolution: Won't Fix This JIRA became stale. > Ozone: In Ratis, leader should validate ContainerCommandRequestProto before > propagating it to followers > --- > > Key: HDFS-11735 > URL: https://issues.apache.org/jira/browse/HDFS-11735 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ozone >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: OzonePostMerge, tocheck > Attachments: HDFS-11735-HDFS-7240.20170501.patch > > > The leader should use the API provided by HDFS-11734 to check if a > ContainerCommandRequestProto is valid before propagating it to followers. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17574) Make NNThroughputBenchmark support argument blockSize suffix with k, m, g, t, p, e
wangzhongwei created HDFS-17574: --- Summary: Make NNThroughputBenchmark support argument blockSize suffix with k, m, g, t, p, e Key: HDFS-17574 URL: https://issues.apache.org/jira/browse/HDFS-17574 Project: Hadoop HDFS Issue Type: Bug Components: benchmarks, hdfs Affects Versions: 3.3.6, 3.3.3 Reporter: wangzhongwei Assignee: wangzhongwei As of now,We can not specify data units while specifying the -blockSize as arguments(like 1m), but have to specify numbers test command: hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs [hdfs://|hdfs://ctyunns/]x -op create -threads 100 -files 25 -filesPerDir 100 -blockSize 1m -close -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16714) Remove okhttp and kotlin dependencies
[ https://issues.apache.org/jira/browse/HDFS-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Pan resolved HDFS-16714. -- Resolution: Duplicate > Remove okhttp and kotlin dependencies > - > > Key: HDFS-16714 > URL: https://issues.apache.org/jira/browse/HDFS-16714 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.3.4 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > hadoop-common already has apache http client dependencies, okhttp is > unnecessary -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17573) Add test code for FSImage parallelization and compression
Sungdong Kim created HDFS-17573: --- Summary: Add test code for FSImage parallelization and compression Key: HDFS-17573 URL: https://issues.apache.org/jira/browse/HDFS-17573 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, namenode Affects Versions: 3.4.1 Reporter: Sungdong Kim Fix For: 3.4.1 The feature added HDFS-14617(in Improve FSImage load time by writing sub-sections to the FSImage index. by [Stephen O'Donnell|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=sodonnell]) makes loading FSImage very faster. But this option cannot be activated when turn on dfs.image.compress=true. In my opinion, larger clusters require both settings at the same time. For Example, the cluster I'm using has approximately 6 million file system objects and FSImage is approximately 11GB with dfs.image.compress=true setting. If turn off the dfs.image.compress option, it is expected to exceed 30GB, in which case it will take a long time to move FSImage from standby to active namenode using high network resource. It was proved in this jira(HDFS-16147 by [kinit|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mofei]) that loading FSImage parallel and FSImage compression can be turned on at the same time. (And worked well on my environment also.) I created this new Jira and PR because the discussion in HDFS-16147 ended in 2021, and I want it to be officially added in the next release, instead of patch available. The actual code of the patch was written by [kinit|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mofei] and I resolved empty sub-section problem(see below comment of HDFS-16147) and added test code. If this is not a proper method, please let me know another way to contribute. Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17571) TestDatanodeManager#testGetBlockLocationConsiderStorageType is flaky
[ https://issues.apache.org/jira/browse/HDFS-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena resolved HDFS-17571. - Resolution: Duplicate > TestDatanodeManager#testGetBlockLocationConsiderStorageType is flaky > > > Key: HDFS-17571 > URL: https://issues.apache.org/jira/browse/HDFS-17571 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Priority: Major > > {noformat} > org.junit.ComparisonFailure: expected: but was: > at org.junit.Assert.assertEquals(Assert.java:117) > at org.junit.Assert.assertEquals(Assert.java:146) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > {noformat} > Ref: > https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6906/2/testReport/org.apache.hadoop.hdfs.server.blockmanagement/TestDatanodeManager/testGetBlockLocationConsiderStorageType/ > https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1628/testReport/junit/org.apache.hadoop.hdfs.server.blockmanagement/TestDatanodeManager/testGetBlockLocationConsiderStorageType/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17557) Fix bug for TestRedundancyMonitor#testChooseTargetWhenAllDataNodesStop
[ https://issues.apache.org/jira/browse/HDFS-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena resolved HDFS-17557. - Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Fix bug for TestRedundancyMonitor#testChooseTargetWhenAllDataNodesStop > -- > > Key: HDFS-17557 > URL: https://issues.apache.org/jira/browse/HDFS-17557 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Due to the modification in HDFS-16456, the current UT has not been able to > run successfully, so we need to fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17564) EC: Fix the issue of inaccurate metrics when decommission mark busy DN
[ https://issues.apache.org/jira/browse/HDFS-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He resolved HDFS-17564. Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > EC: Fix the issue of inaccurate metrics when decommission mark busy DN > -- > > Key: HDFS-17564 > URL: https://issues.apache.org/jira/browse/HDFS-17564 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > If DataNode is marked as busy and contains many EC blocks, when running > decommission DataNode, when execute ErasureCodingWork#addTaskToDatanode, here > will no replication work will be generated for ecBlocksToBeReplicated, but > related metrics (such as DatanodeDescriptor#currApproxBlocksScheduled, > pendingReconstruction and needReconstruction) will still updated. > *Specific code:* > BlockManager#scheduleReconstruction -> BlockManager#chooseSourceDatanodes > [2628~2650] > If DataNode is marked as busy and contains many EC blocks here will not add > to srcNodes. > . > {code:java} > @VisibleForTesting > DatanodeDescriptor[] chooseSourceDatanodes(BlockInfo block, > List containingNodes, > List nodesContainingLiveReplicas, > NumberReplicas numReplicas, List liveBlockIndices, > List liveBusyBlockIndices, List excludeReconstructed, int > priority) { > containingNodes.clear(); > nodesContainingLiveReplicas.clear(); > List srcNodes = new ArrayList<>(); > ... > for (DatanodeStorageInfo storage : blocksMap.getStorages(block)) { > final DatanodeDescriptor node = getDatanodeDescriptorFromStorage(storage); > final StoredReplicaState state = checkReplicaOnStorage(numReplicas, block, > storage, corruptReplicas.getNodes(block), false); > ... > // for EC here need to make sure the numReplicas replicates state correct > // because in the scheduleReconstruction it need the numReplicas to check > // whether need to reconstruct the ec internal block > byte blockIndex = -1; > if (isStriped) { > blockIndex = ((BlockInfoStriped) block) > .getStorageBlockIndex(storage); > countLiveAndDecommissioningReplicas(numReplicas, state, > liveBitSet, decommissioningBitSet, blockIndex); > } > if (priority != LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY > && (!node.isDecommissionInProgress() && !node.isEnteringMaintenance()) > && node.getNumberOfBlocksToBeReplicated() + > node.getNumberOfBlocksToBeErasureCoded() >= maxReplicationStreams) { > if (isStriped && (state == StoredReplicaState.LIVE > || state == StoredReplicaState.DECOMMISSIONING)) { > liveBusyBlockIndices.add(blockIndex); > //HDFS-16566 ExcludeReconstructed won't be reconstructed. > excludeReconstructed.add(blockIndex); > } > continue; // already reached replication limit > } > if (node.getNumberOfBlocksToBeReplicated() + > node.getNumberOfBlocksToBeErasureCoded() >= > replicationStreamsHardLimit) { > if (isStriped && (state == StoredReplicaState.LIVE > || state == StoredReplicaState.DECOMMISSIONING)) { > liveBusyBlockIndices.add(blockIndex); > //HDFS-16566 ExcludeReconstructed won't be reconstructed. > excludeReconstructed.add(blockIndex); > } > continue; > } > if(isStriped || srcNodes.isEmpty()) { > srcNodes.add(node); > if (isStriped) { > liveBlockIndices.add(blockIndex); > } > continue; > } >... > {code} > ErasureCodingWork#addTaskToDatanode[149~157] > {code:java} > @Override > void addTaskToDatanode(NumberReplicas numberReplicas) { > final DatanodeStorageInfo[] targets = getTargets(); > assert targets.length > 0; > BlockInfoStriped stripedBlk = (BlockInfoStriped) getBlock(); > ... > } else if ((numberReplicas.decommissioning() > 0 || > numberReplicas.liveEnteringMaintenanceReplicas() > 0) && > hasAllInternalBlocks()) { > List leavingServiceSources = findLeavingServiceSources(); > // decommissioningSources.size() should be >= targets.length > // if the leavingServiceSources size is 0, here will not to > createReplicationWork > final int num = Math.min(leavingServiceSources.size(), targets.length); > for (int i = 0; i < num; i++) { &
[jira] [Created] (HDFS-17572) TestRouterSecurityManager#testDelegationTokens is flaky
Ayush Saxena created HDFS-17572: --- Summary: TestRouterSecurityManager#testDelegationTokens is flaky Key: HDFS-17572 URL: https://issues.apache.org/jira/browse/HDFS-17572 Project: Hadoop HDFS Issue Type: Bug Reporter: Ayush Saxena {noformat} Expected: (an instance of org.apache.hadoop.security.token.SecretManager$InvalidToken and exception with message a string containing "Renewal request for unknown token") but: exception with message a string containing "Renewal request for unknown token" message was "some_renewer tried to renew an expired token (token for router: HDFS_DELEGATION_TOKEN owner=router, renewer=some_renewer, realUser=, issueDate=1720114742074, maxDate=1720114742174, sequenceNumber=6, masterKeyId=37) max expiration date: 2024-07-04 17:39:02,174+ currentTime: 2024-07-04 17:39:02,233+" Stacktrace was: org.apache.hadoop.security.token.SecretManager$InvalidToken: some_renewer tried to renew an expired token (token for router: HDFS_DELEGATION_TOKEN owner=router, renewer=some_renewer, realUser=, issueDate=1720114742074, maxDate=1720114742174, sequenceNumber=6, masterKeyId=37) max expiration date: 2024-07-04 17:39:02,174+ currentTime: 2024-07-04 17:39:02,233+ at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:692) at org.apache.hadoop.hdfs.server.federation.router.security.RouterSecurityManager.renewDelegationToken(RouterSecurityManager.java:180) at org.apache.hadoop.hdfs.server.federation.security.TestRouterSecurityManager.testDelegationTokens(TestRouterSecurityManager.java:140) {noformat} Ref: https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1628/testReport/junit/org.apache.hadoop.hdfs.server.federation.security/TestRouterSecurityManager/testDelegationTokens/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17571) TestDatanodeManager#testGetBlockLocationConsiderStorageType is flaky
Ayush Saxena created HDFS-17571: --- Summary: TestDatanodeManager#testGetBlockLocationConsiderStorageType is flaky Key: HDFS-17571 URL: https://issues.apache.org/jira/browse/HDFS-17571 Project: Hadoop HDFS Issue Type: Bug Reporter: Ayush Saxena {noformat} org.junit.ComparisonFailure: expected: but was: at org.junit.Assert.assertEquals(Assert.java:117) at org.junit.Assert.assertEquals(Assert.java:146) at org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) {noformat} Ref: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6906/2/testReport/org.apache.hadoop.hdfs.server.blockmanagement/TestDatanodeManager/testGetBlockLocationConsiderStorageType/ https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1628/testReport/junit/org.apache.hadoop.hdfs.server.blockmanagement/TestDatanodeManager/testGetBlockLocationConsiderStorageType/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17570) Respect Non-Default HADOOP_ROOT_LOGGER when HADOOP_DAEMON_ROOT_LOGGER is not specified in Daemon mode
wuchang created HDFS-17570: -- Summary: Respect Non-Default HADOOP_ROOT_LOGGER when HADOOP_DAEMON_ROOT_LOGGER is not specified in Daemon mode Key: HDFS-17570 URL: https://issues.apache.org/jira/browse/HDFS-17570 Project: Hadoop HDFS Issue Type: Improvement Reporter: wuchang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17569) Setup Effective Work Number when Generating Block Reconstruction Work
wuchang created HDFS-17569: -- Summary: Setup Effective Work Number when Generating Block Reconstruction Work Key: HDFS-17569 URL: https://issues.apache.org/jira/browse/HDFS-17569 Project: Hadoop HDFS Issue Type: Improvement Reporter: wuchang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17568) [Decommission]Show Info Log for Repeated Useless refreshNode Operation
wuchang created HDFS-17568: -- Summary: [Decommission]Show Info Log for Repeated Useless refreshNode Operation Key: HDFS-17568 URL: https://issues.apache.org/jira/browse/HDFS-17568 Project: Hadoop HDFS Issue Type: Improvement Reporter: wuchang [https://github.com/apache/hadoop/pull/6921] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17567) Return value of method RouterRpcClient#invokeSequential is not accurate
farmmamba created HDFS-17567: Summary: Return value of method RouterRpcClient#invokeSequential is not accurate Key: HDFS-17567 URL: https://issues.apache.org/jira/browse/HDFS-17567 Project: Hadoop HDFS Issue Type: Bug Components: rbf Affects Versions: 3.4.0 Reporter: farmmamba Assignee: farmmamba -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17566) Got wrong sorted block order when StorageType is considered.
Chenyu Zheng created HDFS-17566: --- Summary: Got wrong sorted block order when StorageType is considered. Key: HDFS-17566 URL: https://issues.apache.org/jira/browse/HDFS-17566 Project: Hadoop HDFS Issue Type: Bug Reporter: Chenyu Zheng Assignee: Chenyu Zheng I found unit test failures like below: ``` [ERROR] Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 9.146 s <<< FAILURE! - in org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager [ERROR] testGetBlockLocationConsiderStorageType(org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager) Time elapsed: 0.206 s <<< FAILURE! org.junit.ComparisonFailure: expected: but was: at org.junit.Assert.assertEquals(Assert.java:117) at org.junit.Assert.assertEquals(Assert.java:146) at org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) ``` The reason is that in HDFS-17098 comparator order is wrong! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17565) EC: dfs.datanode.ec.reconstruction.threads should be configurable.
Chenyu Zheng created HDFS-17565: --- Summary: EC: dfs.datanode.ec.reconstruction.threads should be configurable. Key: HDFS-17565 URL: https://issues.apache.org/jira/browse/HDFS-17565 Project: Hadoop HDFS Issue Type: Improvement Reporter: Chenyu Zheng Assignee: Chenyu Zheng dfs.datanode.ec.reconstruction.threads should be configured, then we can adjust the speed of ec block copy. Especially HDFS-17550 wanna decommissioning DataNode by EC block reconstruction. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17564) Erasure Coding: Fix the issue of inaccurate metrics when decommission mark busy DN
Haiyang Hu created HDFS-17564: - Summary: Erasure Coding: Fix the issue of inaccurate metrics when decommission mark busy DN Key: HDFS-17564 URL: https://issues.apache.org/jira/browse/HDFS-17564 Project: Hadoop HDFS Issue Type: Bug Reporter: Haiyang Hu Assignee: Haiyang Hu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17563) IPC's epoch is not the current writer epoch
Yonghao Zou created HDFS-17563: -- Summary: IPC's epoch is not the current writer epoch Key: HDFS-17563 URL: https://issues.apache.org/jira/browse/HDFS-17563 Project: Hadoop HDFS Issue Type: Bug Components: ipc Affects Versions: 3.3.4, 3.2.4 Reporter: Yonghao Zou I got the following errors when running a cluster: {code:java} 2024-06-28 03:07:57,334 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 127.0.0.1:8485 failed to write txns 4-5. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 1 is not the current writer epoch 0 ; journal id: mycluster at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:521) at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:398) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:191) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:164) at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:28974) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:549) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:518) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2960) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612) at org.apache.hadoop.ipc.Client.call(Client.java:1558) at org.apache.hadoop.ipc.Client.call(Client.java:1455) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy14.journal(Unknown Source) at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:191) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:401) at org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:394) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17562) NPE in ipc/Client.java
Yonghao Zou created HDFS-17562: -- Summary: NPE in ipc/Client.java Key: HDFS-17562 URL: https://issues.apache.org/jira/browse/HDFS-17562 Project: Hadoop HDFS Issue Type: Bug Components: ipc Affects Versions: 3.3.6, 3.3.4, 3.2.4 Reporter: Yonghao Zou An NPE happened today that crashed datanodes. {code:java} 2024-06-28 03:07:58,649 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.io.IOException: DestHost:destPort localhost:9000 , LocalHost:localPort e07ff098d9e2/172.17.0.4:0. Failed on local exception: java.io.IOException: Error reading responses at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:842) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:817) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1616) at org.apache.hadoop.ipc.Client.call(Client.java:1558) at org.apache.hadoop.ipc.Client.call(Client.java:1455) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy19.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:524) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:658) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:855) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.io.IOException: Error reading responses at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1141) Caused by: java.lang.NullPointerException at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1252) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1134) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17561) Make HeartbeatManager.Monitor use up-to-date heartbeatRecheckInterval
Felix N created HDFS-17561: -- Summary: Make HeartbeatManager.Monitor use up-to-date heartbeatRecheckInterval Key: HDFS-17561 URL: https://issues.apache.org/jira/browse/HDFS-17561 Project: Hadoop HDFS Issue Type: Improvement Reporter: Felix N Assignee: Felix N DatanodeManager can changes heartbeatRecheckInterval via reconf API but HeartbeatManager's copy of heartbeatRecheckInterval is fixed at initialization and won't update when DatanodeManager updates with a new config. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17560) When CurrentCall does not include StateId, it can still send requests to the Observer.
fuchaohong created HDFS-17560: - Summary: When CurrentCall does not include StateId, it can still send requests to the Observer. Key: HDFS-17560 URL: https://issues.apache.org/jira/browse/HDFS-17560 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Reporter: fuchaohong When the size of the federated state propagated to the client exceeds maxSizeOfFederatedStateToPropagate, all requests are forwarded to the active NameNode. I don't think this is very reasonable.When enabling Observer read and there is no FederatedState propagated to the client, advanceClientStateId can use sharedGlobalStateId to assign poolLocalStateId and send the request to the Observer. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17559) Fix the uuid as null in NameNodeMXBean
Haiyang Hu created HDFS-17559: - Summary: Fix the uuid as null in NameNodeMXBean Key: HDFS-17559 URL: https://issues.apache.org/jira/browse/HDFS-17559 Project: Hadoop HDFS Issue Type: Bug Reporter: Haiyang Hu Assignee: Haiyang Hu If there is datanode info in includes, but the datanode service is not currently started, the uuid of the datanode will be null. When getting the DeadNodes DeadNodes metric, the following exception will occur: {code:java} 2024-06-26 17:06:49,698 ERROR jmx.JMXJsonServlet (JMXJsonServlet.java:writeAttribute(345)) [qtp1107412069-7704] - getting attribute DeadNodes of Hadoop:service=NameNode,name=NameNodeInfo threw an exception javax.management.RuntimeMBeanException: java.lang.NullPointerException: null value in entry: uuid=null at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651) at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678) at org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17558) RBF: Make maxSizeOfFederatedStateToPropagate work on setResponseHeaderState.
fuchaohong created HDFS-17558: - Summary: RBF: Make maxSizeOfFederatedStateToPropagate work on setResponseHeaderState. Key: HDFS-17558 URL: https://issues.apache.org/jira/browse/HDFS-17558 Project: Hadoop HDFS Issue Type: Bug Components: rbf Reporter: fuchaohong When the size of namespaceIdMap exceeds RBFConfigKeys.DFS_ROUTER_OBSERVER_FEDERATED_STATE_PROPAGATION_MAXSIZE, the federated state does not propagate. This behavior is inconsistent with the configuration description, which states that the size of the federated state propagated to the client should be limited. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17557) Fix bug for TestRedundancyMonitor#testChooseTargetWhenAllDataNodesStop
Haiyang Hu created HDFS-17557: - Summary: Fix bug for TestRedundancyMonitor#testChooseTargetWhenAllDataNodesStop Key: HDFS-17557 URL: https://issues.apache.org/jira/browse/HDFS-17557 Project: Hadoop HDFS Issue Type: Bug Reporter: Haiyang Hu Assignee: Haiyang Hu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17556) timedOutItems should also be check to avoid repeated add to neededReconstruction when decommission
caozhiqiang created HDFS-17556: -- Summary: timedOutItems should also be check to avoid repeated add to neededReconstruction when decommission Key: HDFS-17556 URL: https://issues.apache.org/jira/browse/HDFS-17556 Project: Hadoop HDFS Issue Type: Bug Components: namanode Affects Versions: 3.5.0 Reporter: caozhiqiang Assignee: caozhiqiang In decommission and maintenance process, before added to BlockManager::neededReconstruction block will be check if it has been added. The check contains if block is in BlockManager::neededReconstruction or in PendingReconstructionBlocks::pendingReconstructions as below code. But it also need to check if it is in PendingReconstructionBlocks::timedOutItems. Or else DatanodeAdminDefaultMonitor will add block to BlockManager::neededReconstruction repeatedly if block time out in PendingReconstructionBlocks::pendingReconstructions. {code:java} if (!blockManager.neededReconstruction.contains(block) && blockManager.pendingReconstruction.getNumReplicas(block) == 0 && blockManager.isPopulatingReplQueues()) { // Process these blocks only when active NN is out of safe mode. blockManager.neededReconstruction.add(block, liveReplicas, num.readOnlyReplicas(), num.outOfServiceReplicas(), blockManager.getExpectedRedundancyNum(block)); } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17555) fix NumberFormatException using letter suffix with conf dfs.blocksize
wangzhongwei created HDFS-17555: --- Summary: fix NumberFormatException using letter suffix with conf dfs.blocksize Key: HDFS-17555 URL: https://issues.apache.org/jira/browse/HDFS-17555 Project: Hadoop HDFS Issue Type: Bug Components: benchmarks Affects Versions: 3.3.6, 3.3.4, 3.3.3, 3.3.5 Reporter: wangzhongwei when using NNThroughputBenchmark, the configuration item dfs.blocksize in hdfs-site.xml is configured with a letter as the suffix,such as 256m,NumberFormatException occurred. command: hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs [hdfs://|hdfs://ctyunns/]x -op create -threads 100 -files 25 -filesPerDir 100 -close -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17554) OIV: Print the storage policy name in OIV delimited output
[ https://issues.apache.org/jira/browse/HDFS-17554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hualong Zhang resolved HDFS-17554. -- Resolution: Not A Problem > OIV: Print the storage policy name in OIV delimited output > -- > > Key: HDFS-17554 > URL: https://issues.apache.org/jira/browse/HDFS-17554 > Project: Hadoop HDFS > Issue Type: Improvement > Components: tools >Affects Versions: 3.5.0 >Reporter: Hualong Zhang >Assignee: Hualong Zhang >Priority: Major > > Refer to adding the storage policy name to the OIV output instead of the > erasure coding policy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17528) FsImageValidation: set txid when saving a new image
[ https://issues.apache.org/jira/browse/HDFS-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze resolved HDFS-17528. --- Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed Th pull request is now merged. > FsImageValidation: set txid when saving a new image > --- > > Key: HDFS-17528 > URL: https://issues.apache.org/jira/browse/HDFS-17528 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > - When the fsimage is specified as a file and the FsImageValidation tool > saves a new image (for removing inaccessible inodes), the txid is not set. > Then, the resulted image will have 0 as its txid. > - When the fsimage is specified as a directory, the txid is set. However, it > will get NPE since NameNode metrics is uninitialized (although the metrics is > not used by FsImageValidation). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17554) OIV: Print the storage policy name in OIV delimited output
Hualong Zhang created HDFS-17554: Summary: OIV: Print the storage policy name in OIV delimited output Key: HDFS-17554 URL: https://issues.apache.org/jira/browse/HDFS-17554 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 3.5.0 Reporter: Hualong Zhang Assignee: Hualong Zhang Refer to adding the storage policy name to the OIV output instead of the erasure coding policy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17553) DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures
Zinan Zhuang created HDFS-17553: --- Summary: DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures Key: HDFS-17553 URL: https://issues.apache.org/jira/browse/HDFS-17553 Project: Hadoop HDFS Issue Type: Improvement Components: dfsclient Affects Versions: 3.4.0, 3.3.1 Reporter: Zinan Zhuang [HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an interrupt in DFSStreamer class to interrupt the waitForAckedSeqno call when timeout has exceeded. This method is being used in [DFSOutputStream.java#flushInternal |https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773] , one of whose use case is [DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870] to close a file. What we saw was that we were getting more interrupts during the flushInternal call when we are closing out a file, which was unhandled by DFSClient and got thrown to caller. There's a known issue [HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file failed to close on HDFS side, the lease got leaked until the DFSClient gets recycled. In our HBase setups, DFSClients remain long-lived in each regionserver, which means these files remain undead until the regionserver gets restarted. This issue was observed during datanode decomission because it was stuck on open files caused by above leakage. As it's good to close a HDFS file as smooth as possible, a retry of flushInternal during closeImpl operations would be beneficial to reduce such leakages. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17439) Improve NNThroughputBenchmark to allow non super user to use the tool
[ https://issues.apache.org/jira/browse/HDFS-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell resolved HDFS-17439. -- Fix Version/s: 3.5.0 Resolution: Fixed > Improve NNThroughputBenchmark to allow non super user to use the tool > - > > Key: HDFS-17439 > URL: https://issues.apache.org/jira/browse/HDFS-17439 > Project: Hadoop HDFS > Issue Type: Improvement > Components: benchmarks, namenode >Reporter: Fateh Singh >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > The NNThroughputBenchmark can only be used with hdfs user or any user with > super user privileges since entering/exiting safemode is a privileged > operation. However, when using super user, ACL checks are skipped. Hence it > renders the tool to be useless when testing namenode performance along with > authorization frameworks such as Apache Ranger / any other authorization > frameworks. > An optional argument such as -nonSuperUser can be used to skip the statements > such as entering / exiting safemode. This optional argument makes the tool > useful for incorporating authorization frameworks into the performance > estimation flows. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17552) [ARR] RPC client uses CompletableFuture to support asynchronous operations.
Jian Zhang created HDFS-17552: - Summary: [ARR] RPC client uses CompletableFuture to support asynchronous operations. Key: HDFS-17552 URL: https://issues.apache.org/jira/browse/HDFS-17552 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jian Zhang h3. Description In the implementation of asynchronous Rpc.client, the main methods used include HADOOP-13226, HDFS-10224, etc. However, the existing implementation does not support `CompletableFuture`; instead, it relies on setting up callbacks, which can lead to the "callback hell" problem. Using `CompletableFuture` can better organize asynchronous callbacks. Therefore, on the basis of the existing implementation, by using `CompletableFuture`, once the `client.call` is completed, the asynchronous thread handles the response of this call without blocking the main thread. *Test* new UT ** TestAsyncIPC#testAsyncCallWithCompletableFuture() -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17551) Fix unit test failure caused by HDFS-17464
[ https://issues.apache.org/jira/browse/HDFS-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena resolved HDFS-17551. - Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Fix unit test failure caused by HDFS-17464 > -- > > Key: HDFS-17551 > URL: https://issues.apache.org/jira/browse/HDFS-17551 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: farmmamba >Assignee: farmmamba >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0 > > > As title. > This Jira is used to fix unit test failure caused by HDFS-17464. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17551) Fix unit test failure caused by HDFS-17464
farmmamba created HDFS-17551: Summary: Fix unit test failure caused by HDFS-17464 Key: HDFS-17551 URL: https://issues.apache.org/jira/browse/HDFS-17551 Project: Hadoop HDFS Issue Type: Bug Reporter: farmmamba Assignee: farmmamba As title. This Jira is used to fix unit test failure caused by HDFS-17464. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17539) TestFileChecksum should not spin up a MiniDFSCluster for every test
[ https://issues.apache.org/jira/browse/HDFS-17539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZanderXu resolved HDFS-17539. - Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > TestFileChecksum should not spin up a MiniDFSCluster for every test > --- > > Key: HDFS-17539 > URL: https://issues.apache.org/jira/browse/HDFS-17539 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Felix N >Assignee: Felix N >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0 > > > TestFileChecksum has 34 tests. Add its brother the parameterized > COMPOSITE_CRC version and that's 68 times a cluster is spun up then shutdown > when twice is necessary (or maybe even once but 2 is not too bad). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17549) SecretManager should not hardcode HMAC algorithm
Tsz-wo Sze created HDFS-17549: - Summary: SecretManager should not hardcode HMAC algorithm Key: HDFS-17549 URL: https://issues.apache.org/jira/browse/HDFS-17549 Project: Hadoop HDFS Issue Type: Improvement Components: security Reporter: Tsz-wo Sze Assignee: Tsz-wo Sze -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17548) excessive NO_REQUIRED_STORAGE_TYPE messages
Szymon Orzechowski created HDFS-17548: - Summary: excessive NO_REQUIRED_STORAGE_TYPE messages Key: HDFS-17548 URL: https://issues.apache.org/jira/browse/HDFS-17548 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.3.4 Reporter: Szymon Orzechowski Notification of unavailable storageType has been implemented in HDFS-15815. Yesterday we noted a failure on our production cluster. As a side result of analyzing the reasons for the failure, we found additional error messages: nn-3.wphadoop.dc-2.jumbo._hadoop-hdfs-namenode.log.out:2024-06-07 00:35:23,381 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough replicas was chosen. Reason: \{NO_REQUIRED_STORAGE_TYPE=1} These tell us very little and seem to make absolutely no sense in the case of our cluster (12 racks, no storage policies enabled nor storage types defined). However, in 100% of cases they occur directly (or almost directly) after messages like: nn-3.wphadoop.dc-2.jumbo._hadoop-hdfs-namenode.log.out-2024-06-07 00:35:23,380 INFO org.apache.hadoop.ipc.Server: IPC Server handler 25 on default port 8020, call#9866 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol. create from 10.32.20.25:35130: org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException: The directory item limit of /user/gobblin/loghost/failures/dot_ma/undefined is exceeded: limit=1048576 items=1048576 Which leads me to the conclusion that in this case the NO_REQUIRED_STORAGE_TYPE errors are raised due to reaching the limit specified in property dfs.namenode.fs-limits.max-directory-items. Perhaps they should be restricted as they provide no information and actually report a non-existent problem. Additionally, immediately after clearing the /user/gobblin/loghost/failures/dot_ma/undefined directory, the NO_REQUIRED_STORAGE_TYPE messages stopped appearing. --- I would also like to take this opportunity to ask where to find any list, specifying meaning of values used in the NO_REQUIRED_STORAGE_TYPE=1 messages (in this case, 1) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17547) debug verifyEC check error
ruiliang created HDFS-17547: --- Summary: debug verifyEC check error Key: HDFS-17547 URL: https://issues.apache.org/jira/browse/HDFS-17547 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs-common Reporter: ruiliang When I validate a block that has been corrupted many times, does it appear normal? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} ByteBuffer hb show [0..] {code:java} buffers = {ByteBuffer[5]@3270} 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 cap=65536]" 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more] buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ? outputs = {ByteBuffer[2]@3271} 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]" hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +65,436 more]{code} Can this situation be judged as an anomaly? check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943) at org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355) ... 6 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17546) Implementing Timeout for HostFileReader when FS hangs
Simbarashe Dzinamarira created HDFS-17546: - Summary: Implementing Timeout for HostFileReader when FS hangs Key: HDFS-17546 URL: https://issues.apache.org/jira/browse/HDFS-17546 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Reporter: Simbarashe Dzinamarira Assignee: Simbarashe Dzinamarira Certain implementations of Hadoop have the dfs.hosts file residing on NAS/NFS and potentially with symlinks. If the FS hangs for any reason, the refreshNodes call would infinitely hang on the HostsFileReader until the FS returns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17545) [ARR] router async rpc client.
Jian Zhang created HDFS-17545: - Summary: [ARR] router async rpc client. Key: HDFS-17545 URL: https://issues.apache.org/jira/browse/HDFS-17545 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jian Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17544) [ARR] The router client rpc protocol supports asynchrony.
Jian Zhang created HDFS-17544: - Summary: [ARR] The router client rpc protocol supports asynchrony. Key: HDFS-17544 URL: https://issues.apache.org/jira/browse/HDFS-17544 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Jian Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17542) EC: Optimize the EC block reconstruction.
Chenyu Zheng created HDFS-17542: --- Summary: EC: Optimize the EC block reconstruction. Key: HDFS-17542 URL: https://issues.apache.org/jira/browse/HDFS-17542 Project: Hadoop HDFS Issue Type: Improvement Reporter: Chenyu Zheng Assignee: Chenyu Zheng The current reconstruction process of EC blocks is based on the original contiguous blocks. It is mainly implemented through the work constructed by computeReconstructionWorkForBlocks. It can be roughly divided into three processes: * scheduleReconstruction * chooseTargets * validateReconstructionWork For ordinary contiguous blocks: * (1) scheduleReconstruction Select srcNodes as the source of the copy block according to the status of each replica of the block. * (2) chooseTargets Select the target of the copy. * (3) validateReconstructionWork Add the copy command to srcNode, srcNode receives the command through heartbeat, and executes the block copy from src to target. For EC blocks: (1) and (2) are nearly same. However, in (3), block copying or block reconstruction may occur, or no work may be generated, such as when some storage are busy. If no work is generated, it will lead to the problem described in HDFS-17516. Even if no block copying or block reconstruction is generated, pendingReconstruction and neededReconstruction will still be updated until the block times out, which wastes the scheduling opportunity. In order to be compatible with the original contiguous blocks and decide the specific action in (3), unnecessary liveBlockIndices, liveBusyBlockIndices, and excludeReconstructedIndices are introduced. We know many bug is related here. These can be avoided. Improvements: * Move the work of deciding whether to copy or reconstruct blocks from (3) to (1). Such improvements are more conducive to implementing the explicit specification of the reconstruction block index mentioned in HDFS-16874, and do not need to pass liveBlockIndices, liveBusyBlockIndice. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17541) Support msync requests to a separate RPC server for the active NameNode
Liangjun He created HDFS-17541: -- Summary: Support msync requests to a separate RPC server for the active NameNode Key: HDFS-17541 URL: https://issues.apache.org/jira/browse/HDFS-17541 Project: Hadoop HDFS Issue Type: Improvement Reporter: Liangjun He Assignee: Liangjun He -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17533) RBF: Unit tests that use embedded SQL failing in CI
[ https://issues.apache.org/jira/browse/HDFS-17533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simbarashe Dzinamarira resolved HDFS-17533. --- Resolution: Fixed > RBF: Unit tests that use embedded SQL failing in CI > --- > > Key: HDFS-17533 > URL: https://issues.apache.org/jira/browse/HDFS-17533 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Simbarashe Dzinamarira >Assignee: Simbarashe Dzinamarira >Priority: Major > > In the CI runs for RBF the following two tests are failing > {noformat} > [ERROR] Failures: > [ERROR] > org.apache.hadoop.hdfs.server.federation.router.security.token.TestSQLDelegationTokenSecretManagerImpl.null > [ERROR] Run 1: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 > failures) > java.sql.SQLException: No suitable driver found for > jdbc:derby:memory:TokenStore;create=true > java.lang.RuntimeException: java.sql.SQLException: No suitable driver > found for jdbc:derby:memory:TokenStore;drop=true > [ERROR] Run 2: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 > failures) > java.sql.SQLException: No suitable driver found for > jdbc:derby:memory:TokenStore;create=true > java.lang.RuntimeException: java.sql.SQLException: No suitable driver > found for jdbc:derby:memory:TokenStore;drop=true > [ERROR] Run 3: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 > failures) > java.sql.SQLException: No suitable driver found for > jdbc:derby:memory:TokenStore;create=true > java.lang.RuntimeException: java.sql.SQLException: No suitable driver > found for jdbc:derby:memory:TokenStore;drop=true > [INFO] > [ERROR] > org.apache.hadoop.hdfs.server.federation.store.driver.TestStateStoreMySQL.null > [ERROR] Run 1: TestStateStoreMySQL Multiple Failures (2 failures) > java.sql.SQLException: No suitable driver found for > jdbc:derby:memory:StateStore;create=true > java.lang.RuntimeException: java.sql.SQLException: No suitable driver > found for jdbc:derby:memory:StateStore;drop=true > [ERROR] Run 2: TestStateStoreMySQL Multiple Failures (2 failures) > java.sql.SQLException: No suitable driver found for > jdbc:derby:memory:StateStore;create=true > java.lang.RuntimeException: java.sql.SQLException: No suitable driver > found for jdbc:derby:memory:StateStore;drop=true > [ERROR] Run 3: TestStateStoreMySQL Multiple Failures (2 failures) > java.sql.SQLException: No suitable driver found for > jdbc:derby:memory:StateStore;create=true > java.lang.RuntimeException: java.sql.SQLException: No suitable driver > found for jdbc:derby:memory:StateStore;drop=true {noformat} > [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6804/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt] > > I believe the fix is first registering the driver: > [https://dev.mysql.com/doc/connector-j/en/connector-j-usagenotes-connect-drivermanager.html] > [https://stackoverflow.com/questions/22384710/java-sql-sqlexception-no-suitable-driver-found-for-jdbcmysql-localhost3306] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17538) Add tranfer priority queue for decommissioning datanode
[ https://issues.apache.org/jira/browse/HDFS-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu resolved HDFS-17538. --- Resolution: Duplicate > Add tranfer priority queue for decommissioning datanode > --- > > Key: HDFS-17538 > URL: https://issues.apache.org/jira/browse/HDFS-17538 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Yuanbo Liu >Priority: Major > Attachments: image-2024-05-29-16-24-45-601.png, > image-2024-05-29-16-26-58-359.png, image-2024-05-29-16-27-35-886.png > > > When decommissioning datanode, blocks will be checked one by one disk, then > blocks will be sent to trigger tranfer works in DN. This will make one disk > of decommissioning dn very busy and cpus stuck in io-wait with high loads, > and sometime even lead to OOM as below: > !image-2024-05-29-16-24-45-601.png|width=909,height=170! > !image-2024-05-29-16-26-58-359.png|width=909,height=228! > !image-2024-05-29-16-27-35-886.png|width=930,height=218! > Proposal to add priority queue for transfering blocks when decommisioning > datanode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17540) Namenode retry to warmup EDEK cache forever
Yu Zhang created HDFS-17540: --- Summary: Namenode retry to warmup EDEK cache forever Key: HDFS-17540 URL: https://issues.apache.org/jira/browse/HDFS-17540 Project: Hadoop HDFS Issue Type: Bug Components: encryption, namenode Affects Versions: 2.8.0 Reporter: Yu Zhang https://issues.apache.org/jira/browse/HDFS-9405 adds a background thread to pre-warm EDEK cache. However this fails and retries continuously if key retrieval fails for one encryption zone. In our usecase, we have temporarily removed keys for certain encryption zones. Currently namenode and kms log is filled up with errors related to background thread retrying warmup for ever . The pre-warm thread should * Continue to refresh other encryption zones even if it fails for one * Should retry only if it fails for all encryption zones, which will be the case when kms is down. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17539) TestFileChecksum should not spin up a MiniDFSCluster for every test
Felix N created HDFS-17539: -- Summary: TestFileChecksum should not spin up a MiniDFSCluster for every test Key: HDFS-17539 URL: https://issues.apache.org/jira/browse/HDFS-17539 Project: Hadoop HDFS Issue Type: Improvement Reporter: Felix N Assignee: Felix N TestFileChecksum has 34 tests. Add its brother the parameterized COMPOSITE_CRC version and that's 68 times a cluster is spun up then shutdown when twice is necessary (or maybe even once but 2 is not too bad). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17538) Add tranfer priority queue for decommissioning datanode
Yuanbo Liu created HDFS-17538: - Summary: Add tranfer priority queue for decommissioning datanode Key: HDFS-17538 URL: https://issues.apache.org/jira/browse/HDFS-17538 Project: Hadoop HDFS Issue Type: Improvement Reporter: Yuanbo Liu Attachments: image-2024-05-29-16-24-45-601.png, image-2024-05-29-16-26-58-359.png, image-2024-05-29-16-27-35-886.png When decommissioning datanode, blocks will be checked one by one disk, then blocks will be sent to trigger tranfer works in DN. This will make one disk of decommissioning dn very busy and cpus stuck in io-wait with high loads, and sometime even lead to OOM as below: !image-2024-05-29-16-24-45-601.png! !image-2024-05-29-16-26-58-359.png! !image-2024-05-29-16-27-35-886.png! Proposal to add priority queue for transfering blocks when decommisioning datanode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17532) RBF: Allow router state store cache update to overwrite and delete in parallel
[ https://issues.apache.org/jira/browse/HDFS-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZanderXu resolved HDFS-17532. - Hadoop Flags: Reviewed Resolution: Fixed > RBF: Allow router state store cache update to overwrite and delete in parallel > -- > > Key: HDFS-17532 > URL: https://issues.apache.org/jira/browse/HDFS-17532 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, rbf >Reporter: Felix N >Assignee: Felix N >Priority: Minor > Labels: pull-request-available > > Current implementation for router state store update is quite inefficient, so > much that when routers are removed and a lot of NameNodeMembership records > are deleted in a short burst, the deletions triggered a router safemode in > our cluster and caused a lot of troubles. > This ticket aims to allow the overwrite part and delete part of > org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords > to run in parallel. > See HDFS-17529 for the other half of this improvement. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17537) RBF : Last block report is incorrect in federationhealth.html
Ananya Singh created HDFS-17537: --- Summary: RBF : Last block report is incorrect in federationhealth.html Key: HDFS-17537 URL: https://issues.apache.org/jira/browse/HDFS-17537 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, rbf Affects Versions: 3.3.6 Reporter: Ananya Singh Assignee: Ananya Singh -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17536) RBF: Format safe-mode related logic and fix a race
ZanderXu created HDFS-17536: --- Summary: RBF: Format safe-mode related logic and fix a race Key: HDFS-17536 URL: https://issues.apache.org/jira/browse/HDFS-17536 Project: Hadoop HDFS Issue Type: Task Reporter: ZanderXu Assignee: ZanderXu RBF: Format safe-mode related logic and fix a race. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?
ruiliang created HDFS-17535: --- Summary: I have confirmed the EC corrupt file, can this corrupt file be restored? Key: HDFS-17535 URL: https://issues.apache.org/jira/browse/HDFS-17535 Project: Hadoop HDFS Issue Type: Bug Components: ec, hdfs Affects Versions: 3.1.0 Reporter: ruiliang I learned that EC does have a major bug with file corrupt https://issues.apache.org/jira/browse/HDFS-15759 1:I have confirmed the EC corrupt file, can this corrupt file be restored? Have important data that is causing us production data loss issues? Is there a way to recover corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups \{blk_-xx[blk_-xx]} 2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? hdfs version 3.1.0 thank you -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17529) RBF: Improve router state store cache entry deletion
[ https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZanderXu resolved HDFS-17529. - Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > RBF: Improve router state store cache entry deletion > > > Key: HDFS-17529 > URL: https://issues.apache.org/jira/browse/HDFS-17529 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, rbf >Reporter: Felix N >Assignee: Felix N >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Current implementation for router state store update is quite inefficient, so > much that when routers are removed and a lot of NameNodeMembership records > are deleted in a short burst, the deletions triggered a router safemode in > our cluster and caused a lot of troubles. > This ticket aims to improve the deletion process for ZK state store > implementation. > See HDFS-17532 for the other half of this improvement -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17530) Aynchronous router
[ https://issues.apache.org/jira/browse/HDFS-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri resolved HDFS-17530. Resolution: Duplicate > Aynchronous router > -- > > Key: HDFS-17530 > URL: https://issues.apache.org/jira/browse/HDFS-17530 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Jian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17464) Improve some logs output in class FsDatasetImpl
[ https://issues.apache.org/jira/browse/HDFS-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haiyang Hu resolved HDFS-17464. --- Fix Version/s: 3.5.0 Resolution: Resolved > Improve some logs output in class FsDatasetImpl > --- > > Key: HDFS-17464 > URL: https://issues.apache.org/jira/browse/HDFS-17464 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: farmmamba >Assignee: farmmamba >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17533) RBF Tests that use embedded SQL failing unit tests
Simbarashe Dzinamarira created HDFS-17533: - Summary: RBF Tests that use embedded SQL failing unit tests Key: HDFS-17533 URL: https://issues.apache.org/jira/browse/HDFS-17533 Project: Hadoop HDFS Issue Type: Test Reporter: Simbarashe Dzinamarira In the CI runs for RBF the following two tests are failing {noformat} [ERROR] Failures: [ERROR] org.apache.hadoop.hdfs.server.federation.router.security.token.TestSQLDelegationTokenSecretManagerImpl.null [ERROR] Run 1: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 failures) java.sql.SQLException: No suitable driver found for jdbc:derby:memory:TokenStore;create=true java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for jdbc:derby:memory:TokenStore;drop=true [ERROR] Run 2: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 failures) java.sql.SQLException: No suitable driver found for jdbc:derby:memory:TokenStore;create=true java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for jdbc:derby:memory:TokenStore;drop=true [ERROR] Run 3: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 failures) java.sql.SQLException: No suitable driver found for jdbc:derby:memory:TokenStore;create=true java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for jdbc:derby:memory:TokenStore;drop=true [INFO] [ERROR] org.apache.hadoop.hdfs.server.federation.store.driver.TestStateStoreMySQL.null [ERROR] Run 1: TestStateStoreMySQL Multiple Failures (2 failures) java.sql.SQLException: No suitable driver found for jdbc:derby:memory:StateStore;create=true java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for jdbc:derby:memory:StateStore;drop=true [ERROR] Run 2: TestStateStoreMySQL Multiple Failures (2 failures) java.sql.SQLException: No suitable driver found for jdbc:derby:memory:StateStore;create=true java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for jdbc:derby:memory:StateStore;drop=true [ERROR] Run 3: TestStateStoreMySQL Multiple Failures (2 failures) java.sql.SQLException: No suitable driver found for jdbc:derby:memory:StateStore;create=true java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for jdbc:derby:memory:StateStore;drop=true {noformat} [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6804/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt] I believe the fix is first registering the driver: [https://dev.mysql.com/doc/connector-j/en/connector-j-usagenotes-connect-drivermanager.html] [https://stackoverflow.com/questions/22384710/java-sql-sqlexception-no-suitable-driver-found-for-jdbcmysql-localhost3306] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17532) Allow router state store cache update to overwrite and delete in parallel
Felix N created HDFS-17532: -- Summary: Allow router state store cache update to overwrite and delete in parallel Key: HDFS-17532 URL: https://issues.apache.org/jira/browse/HDFS-17532 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, rbf Reporter: Felix N Assignee: Felix N Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket aims to allow the overwrite part and delete part of org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords to run in parallel. See HDFS-17529 for the other half of this improvement. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17531) RBF: Aynchronous router RPC.
Jian Zhang created HDFS-17531: - Summary: RBF: Aynchronous router RPC. Key: HDFS-17531 URL: https://issues.apache.org/jira/browse/HDFS-17531 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jian Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17530) Aynchronous router
Jian Zhang created HDFS-17530: - Summary: Aynchronous router Key: HDFS-17530 URL: https://issues.apache.org/jira/browse/HDFS-17530 Project: Hadoop HDFS Issue Type: New Feature Reporter: Jian Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17529) Improve router state store cache update
Felix N created HDFS-17529: -- Summary: Improve router state store cache update Key: HDFS-17529 URL: https://issues.apache.org/jira/browse/HDFS-17529 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs, rbf Reporter: Felix N Assignee: Felix N Current implementation for router state store update is quite inefficient, so much that when routers are removed and a lot of NameNodeMembership records are deleted in a short burst, the deletions triggered a router safemode in our cluster and caused a lot of troubles. This ticket contains 2 parts: improving the deletion process for ZK state store implementation, and allowing the overwrite part and delete part of -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17509) RBF: Fix ClientProtocol.concat will throw NPE if tgr is a empty file.
[ https://issues.apache.org/jira/browse/HDFS-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZanderXu resolved HDFS-17509. - Fix Version/s: 3.5.0 Resolution: Fixed > RBF: Fix ClientProtocol.concat will throw NPE if tgr is a empty file. > -- > > Key: HDFS-17509 > URL: https://issues.apache.org/jira/browse/HDFS-17509 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: liuguanghua >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0 > > > hdfs dfs -concat /tmp/merge /tmp/t1 /tmp/t2 > When /tmp/merge is a empty file, this command will throw NPE via DFSRouter. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17520) TestDFSAdmin.testAllDatanodesReconfig and TestDFSAdmin.testDecommissionDataNodesReconfig failed
[ https://issues.apache.org/jira/browse/HDFS-17520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan resolved HDFS-17520. --- Fix Version/s: 3.4.1 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.4.1, 3.5.0 Resolution: Fixed > TestDFSAdmin.testAllDatanodesReconfig and > TestDFSAdmin.testDecommissionDataNodesReconfig failed > --- > > Key: HDFS-17520 > URL: https://issues.apache.org/jira/browse/HDFS-17520 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > > {code:java} > [ERROR] Tests run: 21, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: > 44.521 s <<< FAILURE! - in org.apache.hadoop.hdfs.tools.TestDFSAdmin > [ERROR] testAllDatanodesReconfig(org.apache.hadoop.hdfs.tools.TestDFSAdmin) > Time elapsed: 2.086 s <<< FAILURE! > java.lang.AssertionError: > Expecting: > <["Reconfiguring status for node [127.0.0.1:43731]: SUCCESS: Changed > property dfs.datanode.peer.stats.enabled", > " From: "false"", > " To: "true"", > "started at Fri May 10 13:02:51 UTC 2024 and finished at Fri May 10 > 13:02:51 UTC 2024."]> > to contain subsequence: > <["SUCCESS: Changed property dfs.datanode.peer.stats.enabled", > " From: "false"", > " To: "true""]> > at > org.apache.hadoop.hdfs.tools.TestDFSAdmin.testAllDatanodesReconfig(TestDFSAdmin.java:1286) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > at org.junit.runners.ParentRunner.run(ParentRunner.java:413) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17528) FsImageValidation: set txid when saving a new image
Tsz-wo Sze created HDFS-17528: - Summary: FsImageValidation: set txid when saving a new image Key: HDFS-17528 URL: https://issues.apache.org/jira/browse/HDFS-17528 Project: Hadoop HDFS Issue Type: Improvement Reporter: Tsz-wo Sze -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17527) RBF: Routers should not allow observer reads when namenode stateId context is disabled
Simbarashe Dzinamarira created HDFS-17527: - Summary: RBF: Routers should not allow observer reads when namenode stateId context is disabled Key: HDFS-17527 URL: https://issues.apache.org/jira/browse/HDFS-17527 Project: Hadoop HDFS Issue Type: Improvement Reporter: Simbarashe Dzinamarira -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17514) RBF: Routers keep using cached stateID even when active NN returns unset header
[ https://issues.apache.org/jira/browse/HDFS-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simbarashe Dzinamarira resolved HDFS-17514. --- Resolution: Fixed > RBF: Routers keep using cached stateID even when active NN returns unset > header > --- > > Key: HDFS-17514 > URL: https://issues.apache.org/jira/browse/HDFS-17514 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Simbarashe Dzinamarira >Assignee: Simbarashe Dzinamarira >Priority: Minor > Labels: pull-request-available > > When a namenode that had "dfs.namenode.state.context.enabled" set to true is > restarted with the configuration set to false, routers will keep using a > previously cached state ID. > Without RBF > * clients that fetched the old stateID could have stale reads even after > msyncing > * new clients will go to the active. > With RBF > * client that fetched the old stateID could have stale reads like above. > * New clients will also fetch the stale stateID and potentially have stale > reads > New clients that are created after the restart should not fetch the stale > state ID. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17099) Fix Null Pointer Exception when stop namesystem in HDFS
[ https://issues.apache.org/jira/browse/HDFS-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He resolved HDFS-17099. Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Fix Null Pointer Exception when stop namesystem in HDFS > --- > > Key: HDFS-17099 > URL: https://issues.apache.org/jira/browse/HDFS-17099 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ConfX >Assignee: ConfX >Priority: Critical > Labels: pull-request-available > Fix For: 3.5.0 > > Attachments: reproduce.sh > > > h2. What happend: > Got NullPointerException when stop namesystem in HDFS. > h2. Buggy code: > > {code:java} > void stopActiveServices() { > ... > if (dir != null && getFSImage() != null) { > if (getFSImage().editLog != null) { // <--- Check whether editLog is > null > getFSImage().editLog.close(); > } > // Update the fsimage with the last txid that we wrote > // so that the tailer starts from the right spot. > getFSImage().updateLastAppliedTxIdFromWritten(); // <--- BUG: Even if > editLog is null, this line will still be executed and cause nullpointer > exception > } > ... > } public void updateLastAppliedTxIdFromWritten() { > this.lastAppliedTxId = editLog.getLastWrittenTxId(); // < This will > cause nullpointer exception if editLog is null > } {code} > h2. StackTrace: > > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSImage.updateLastAppliedTxIdFromWritten(FSImage.java:1553) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:1463) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.close(FSNamesystem.java:1815) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:1017) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:248) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:194) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:181) > {code} > h2. How to reproduce: > (1) Set {{dfs.namenode.top.windows.minutes}} to {{{}37914516,32,0{}}}; or set > {{dfs.namenode.top.window.num.buckets}} to {{{}244111242{}}}. > (2) Run test: > {{org.apache.hadoop.hdfs.server.namenode.TestNameNodeHttpServerXFrame#testSecondaryNameNodeXFrame}} > h2. What's more: > I'm still investigating how the parameter > {{dfs.namenode.top.windows.minutes}} triggered the buggy code. > > For an easy reproduction, run the reproduce.sh in the attachment. > We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org