[jira] [Resolved] (HDFS-17575) SaslDataTransferClient should use SaslParticipant to create messages

2024-08-05 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-17575.
---
Hadoop Flags: Reviewed
  Resolution: Fixed

The pull request #6954 is now merged.

> SaslDataTransferClient should use SaslParticipant to create messages
> 
>
> Key: HDFS-17575
> URL: https://issues.apache.org/jira/browse/HDFS-17575
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Currently, a SaslDataTransferClient may send a message without using its 
> SaslParticipant as below.  {code}
>   sendSaslMessage(out, new byte[0]);
> {code}
> Instead, it should use its SaslParticipant to create the response.
> {code}
>   byte[] localResponse = sasl.evaluateChallengeOrResponse(remoteResponse);
>   sendSaslMessage(out, localResponse);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17599) Fix the mismatch between locations and indices for mover

2024-08-03 Thread Tao Li (Jira)
Tao Li created HDFS-17599:
-

 Summary: Fix the mismatch between locations and indices for mover
 Key: HDFS-17599
 URL: https://issues.apache.org/jira/browse/HDFS-17599
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.4.0, 3.3.0
Reporter: Tao Li
Assignee: Tao Li
 Attachments: image-2024-08-03-17-59-08-059.png, 
image-2024-08-03-18-00-01-950.png

We set the EC policy to (6+3) and also have nodes that were in state 
ENTERING_MAINTENANCE.
 
When we move the data of some directories from SSD to HDD, some blocks move 
fail due to disk full, as shown in the figure below (blk_-9223372033441574269).
We tried to move again and found the following error "{color:#FF}Replica 
does not exist{color}".
Observing the information of fsck, it can be found that the wrong 
blockid(blk_-9223372033441574270) was found when moving block.
 
{*}Mover Logs{*}:
!image-2024-08-03-17-59-08-059.png|width=741,height=85!
 
{*}FSCK Info{*}:
!image-2024-08-03-18-00-01-950.png|width=738,height=120!
 
{*}Root Cause{*}:
Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node is 
processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state in the 
locations is filtered, but the indices are not adapted, resulting in a mismatch 
between the location and indices lengths. Finally, ec block calculates the 
wrong blockid when getting internal block (see 
`DBlockStriped#getInternalBlock`).
 
We added debug logs, and a few key messages are shown below. {color:#FF}The 
result is an incorrect correspondence: xx.xx.7.31 -> 
-9223372033441574270{color}.
{code:java}
DBlock getInternalBlock(StorageGroup storage) {
  // storage == xx.xx.7.31
  // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, 
xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, 
xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is 
filtered)
  int idxInLocs = locations.indexOf(storage);
  if (idxInLocs == -1) {
return null;
  }
  // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])   
  byte idxInGroup = indices[idxInLocs];
  // blkId: -9223372033441574272 + 2 = -9223372033441574270
  long blkId = getBlock().getBlockId() + idxInGroup;
  long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
  dataBlockNum, idxInGroup);
  Block blk = new Block(getBlock());
  blk.setBlockId(blkId);
  blk.setNumBytes(numBytes);
  DBlock dblk = new DBlock(blk);
  dblk.addLocation(storage);
  return dblk;
} {code}
{*}Solution{*}:
When initializing DBlockStriped, if any location is filtered out, we need to 
remove the corresponding element in the indices to do the adaptation.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17544) [ARR] The router client rpc protocol PB supports asynchrony.

2024-08-02 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17544.

Fix Version/s: HDFS-17531
 Hadoop Flags: Reviewed
   Resolution: Fixed

> [ARR] The router client rpc protocol PB supports asynchrony.
> 
>
> Key: HDFS-17544
> URL: https://issues.apache.org/jira/browse/HDFS-17544
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: HDFS-17531
>
>
> *Describe*
> In order not to affect other modules, when implementing router asynchronous 
> client rpc protocolPB, it mainly extends the original protocolPB; The 
> implemented protocolPB is as follows:
> *RouterClientProtocolTranslatorPB* extends ClientNamenodeProtocolTranslatorPB
> *RouterGetUserMappingsProtocolTranslatorPB* extends 
> GetUserMappingsProtocolClientSideTranslatorPB
> *RouterNamenodeProtocolTranslatorPB* extends NamenodeProtocolTranslatorPB
> *RouterRefreshUserMappingsProtocolTranslatorPB* extends 
> RefreshUserMappingsProtocolClientSideTranslatorPB
> Then let the router's *ConnectionPool* use the aforementioned protocolPB
> In the implementation of asynchronous Rpc.client, the main methods used 
> include HADOOP-13226, HDFS-10224, etc
> {*}AsyncRpcProtocolPBUtil{*}: Make the implementation of asynchronous rpc 
> protocol more concise and clear.
>  
> *Test*
> new UTs:
> 1.TestAsyncRpcProtocolPBUtil
> 2.TestRouterClientSideTranslatorPB



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17598) Optimizations for DatanodeManager for large-scale cases

2024-07-31 Thread Hao-Nan Zhu (Jira)
Hao-Nan Zhu created HDFS-17598:
--

 Summary: Optimizations for DatanodeManager for large-scale cases
 Key: HDFS-17598
 URL: https://issues.apache.org/jira/browse/HDFS-17598
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: performance
Affects Versions: 3.4.0
Reporter: Hao-Nan Zhu


Hello,

 

I wonder if there are chances to optimize a little bit for the 
{_}DatanodeManager{_}, for its performance when the number of _datanodes_ is 
large
 * 
[_fetchDatanodes_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1144]
 calls 
[_removeDecomNodeFromList_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L817]
 for both lists for live and dead datanodes. _removeDecomNodeFromList_ will 
have to iterate all datanodes in the list. This can be optimized by checking 
whether the node is decommissioned using _node.isDecommissioned()_ before 
adding the node to the lists of live and dead datanodes. 
 * 
[_getNumLiveDataNodes_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1055]
 iterates over all datanodes. However, 
[_getNumDeadDataNodes_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1068]
 gets the size in a different (presumably more efficient) way. Is there a 
reason that _getNumLiveDataNodes_ has to iterate all over the 
{_}datanodeMap{_}? Can we use the same way for _getNumLiveDataNodes?_ 


And similar observations for 
[_resetLastCachingDirectiveSentTime_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1560]
 and 
[_getDatanodeListForReport_|https://github.com/naver/hadoop/blob/0c0a80f96283b5a7be234663e815bc04bafc8be2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1253].
 It seems optimizing these methods can contribute to more performant checks, 
especially when the number of datanodes is larger. Are there any plans on 
having these types of large-scale (micro) optimizations?

 

Please let me know if I need to provide more information. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14883) NPE when the second SNN is starting

2024-07-30 Thread Tao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li resolved HDFS-14883.
---
Resolution: Duplicate

> NPE when the second SNN is starting
> ---
>
> Key: HDFS-14883
> URL: https://issues.apache.org/jira/browse/HDFS-14883
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
>  Labels: multi-sbnn
> Fix For: 3.4.0, 3.3.1, 2.10.1, 3.2.2
>
>
>  
> {{| WARN | qtp79782883-47 | /imagetransfer | ServletHandler.java:632
>  java.io.IOException: PutImage failed. java.lang.NullPointerException
>  at 
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.validateRequest(ImageServlet.java:198)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:485)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17597) [ARR] RouterSnapshot supports asynchronous rpc.

2024-07-26 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17597:
-

 Summary: [ARR] RouterSnapshot supports asynchronous rpc.
 Key: HDFS-17597
 URL: https://issues.apache.org/jira/browse/HDFS-17597
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jian Zhang


*Describe*

The main new addition is RouterAsyncSnapshot, which extends RouterSnapshot so 
that supports asynchronous rpc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17596) [ARR] RouterStoragePolicy supports asynchronous rpc.

2024-07-26 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17596:
-

 Summary: [ARR] RouterStoragePolicy supports asynchronous rpc.
 Key: HDFS-17596
 URL: https://issues.apache.org/jira/browse/HDFS-17596
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jian Zhang


*Describe*

The main new addition is RouterAsyncStoragePolicy, which extends 
RouterStoragePolicy so that supports asynchronous rpc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17595) [ARR] ErasureCoding supports asynchronous rpc.

2024-07-26 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17595:
-

 Summary: [ARR] ErasureCoding supports asynchronous rpc.
 Key: HDFS-17595
 URL: https://issues.apache.org/jira/browse/HDFS-17595
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jian Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17594) [ARR] RouterCacheAdmin supports asynchronous rpc.

2024-07-26 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17594:
-

 Summary: [ARR] RouterCacheAdmin supports asynchronous rpc.
 Key: HDFS-17594
 URL: https://issues.apache.org/jira/browse/HDFS-17594
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jian Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17593) Allow setting block locations when opening streams

2024-07-26 Thread Csaba Ringhofer (Jira)
Csaba Ringhofer created HDFS-17593:
--

 Summary: Allow setting block locations when opening streams
 Key: HDFS-17593
 URL: https://issues.apache.org/jira/browse/HDFS-17593
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Csaba Ringhofer


The HDFS client seems to always get block locations from the namenode when 
opening a file:

https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L1099

This leads to unnecessary RPCs in Apache Impala when doing remote reads, as the 
block locations are cached globally and the executors already have a good guess 
about the block locations when opening a stream. Unless the cached block 
locations are stale ideally no RPC should be made to the namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17592) FastCopy support data copy in different nameservices without federation

2024-07-26 Thread liuguanghua (Jira)
liuguanghua created HDFS-17592:
--

 Summary: FastCopy support data copy in different nameservices 
without federation
 Key: HDFS-17592
 URL: https://issues.apache.org/jira/browse/HDFS-17592
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: liuguanghua


FastCopy is  a faster data copy tools.  In federation cluster  or a single 
cluster , FastCopy copy blocks via hardlink.  This is more much faster than 
original copy.

FastCopy can support data copy via transfer in different nameservices without 
federation. In theory, it could save almost half the time  compared to origianl 
 copy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17591) RBF: Router should follow X-FRAME-OPTIONS protection setting

2024-07-25 Thread Takanobu Asanuma (Jira)
Takanobu Asanuma created HDFS-17591:
---

 Summary: RBF: Router should follow X-FRAME-OPTIONS protection 
setting
 Key: HDFS-17591
 URL: https://issues.apache.org/jira/browse/HDFS-17591
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: Takanobu Asanuma
Assignee: Takanobu Asanuma


Router UI doesn't have X-FRAME-OPTIONS in its header. Router should load the 
value of dfs.xframe.value.

This issue is reported by Daiki Mashima.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17590) `NullPointerException` triggered in `createBlockReader` during retry iteration

2024-07-23 Thread Elmer J Fudd (Jira)
Elmer J Fudd created HDFS-17590:
---

 Summary: `NullPointerException` triggered in `createBlockReader` 
during retry iteration
 Key: HDFS-17590
 URL: https://issues.apache.org/jira/browse/HDFS-17590
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.4.0
Reporter: Elmer J Fudd


While reading blocks of data using `DFSInputStream` in `createBlockReader`, a 
`IOException` originating from `getBlockAt()` that triggers a retry iteration 
results in a `NullPointerException` when passing `dnInfo` to 
`addToLocalDeadNodes` in the catch block.

 

This is the relevant callstack portion from our logs (from 3.4.0, but this was 
occurring with "trunk" versions as new as late June which 3.4.1 builds upon):
{noformat}
...
java.lang.NullPointerException at java.base/
java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011) at 
java.base/java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
 at 
org.apache.hadoop.hdfs.DFSInputStream.addToLocalDeadNodes(DFSInputStream.java:184)
 at 
org.apache.hadoop.hdfs.DFSStripedInputStream.createBlockReader(DFSStripedInputStream.java:279)
 at org.apache.hadoop.hdfs.StripeReader.readChunk(StripeReader.java:304) at 
org.apache.hadoop.hdfs.StripeReader.readStripe(StripeReader.java:335) at 
org.apache.hadoop.hdfs.DFSStripedInputStream.fetchBlockByteRange(DFSStripedInputStream.java:504)
 at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1472) at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1436) at 
org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:124) at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:119) 
...{noformat}
 

What we observe is that `getBlockAt()` thorws an `IOException` [here|] 

{code:java}
//check offset 
if (offset < 0 || offset >= getFileLength()) { 
  throw new IOException("offset < 0 || offset >= getFileLength(), offset=" 
  + offset 
  + ", locatedBlocks=" + locatedBlocks); 
} 
{code}
 

This is eventually caught in `createBlockReader`. The catch block attempts to 
handle the error and, as part of the error handling, invokes the 
*`*addToLocalDeadNodes` method. See that a `dnInfo` object passed to this 
method is `NULL` as it wasn't fully allocated 
[here|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L247],
 which results in a `NullPointerException`. 

 

To sum up, this is the failure path according to the logs:
 # `IOException` is thrown in `getBlockAt` 
([code|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L479])
 # The exception propagates to `getBlockGroupAt` 
([code|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L476])
 # It further propagates to `refreshLocatedBlock` 
([code|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L459]
 # `IOException` caught in `createBlockReader` 
([code|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L247])
 # Error handling in the catch block of `createBlockReader` invokes 
`addToLocalDeadNodes` 
([code|https://github.com/apache/hadoop/blob/4525c7e35ea22d7a6350b8af10eb8d2ff68376e7/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java#L281])

 # Execution throws `NullPointerException` since `dnInfo` is NULL

 

A simple fix as a `NULL` check to add only non-NULL `dnInfo` objects to the 
hash map, and similarly adjusting the log messages in the `catch` block, should 
solve the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16690) Automatically format new unformatted JournalNodes using JournalNodeSyncer

2024-07-23 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16690.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Automatically format new unformatted JournalNodes using JournalNodeSyncer 
> --
>
> Key: HDFS-16690
> URL: https://issues.apache.org/jira/browse/HDFS-16690
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node
>Affects Versions: 3.4.0, 3.3.5
> Environment: Demonstrated in a Kubernetes environment running Java 11.
>  # Start new cluster, but short 1 JN (minimum quorum, and the missing JN 
> won’t resolve). VERIFY:
>  - NN formats the 2 existing JN and stabilizes.  NOTE: Formatting using just 
> a quorum will be a separate submission
>  - Messages show sync between JN-0 and JN-1, and NN -> JN.
>  # Scale JN stateful set to add missing JN. VERIFY:
>  - New JN starts
>  - All other JN and all NN report IP address change (IP Address resolution).  
> NOTE: require HADOOP-18365 and HDFS-16688
>  - Messages show sync between all JN, and NN -> JN
>  - New JN is formatted at least once (possibly by multiple other JN)
>  - New JN storage directory is formatted only once
>  - New JN joins cluster (lastWriterEpoch is non-zero)
>Reporter: Steve Vaughan
>Assignee: Aswin M Prabhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> If an unformatted JournalNode is added to an existing JournalNode set, 
> instances of the JournalNodeSyncer are unable to sync to the new node.  When 
> a sync receives a JournalNotFormattedException, we can initiate a format 
> operation, and then retry the synchronization.
> Conceptually this means that the JournalNodes and their data can be managed 
> independently from the rest of the system, as the JournalNodes will 
> incorporate new JournalNode instances.  Once the new JournalNode is 
> formatted, it can participate in shared edits from the NameNodes. 
> I've been testing an update to the InterQJournalProtocol to add a format call 
> like that used by the NameNode.  Current tests include starting an HA cluster 
> from scratch, but with 2 JournalNode instances.  Once the cluster is up, I 
> can add the 3rd JournalNode (which is unformatted), and the other 2 
> JournalNodes will eventually attempt to sync which results in a formatting 
> and subsequent sync.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17589) hdfs EC data old blk reconstruct old blk not delete

2024-07-22 Thread ruiliang (Jira)
ruiliang created HDFS-17589:
---

 Summary: hdfs EC data  old blk reconstruct   old blk not delete
 Key: HDFS-17589
 URL: https://issues.apache.org/jira/browse/HDFS-17589
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.1.1
Reporter: ruiliang


The reason is that the cluster was faulty before, and Datanodes kept losing 
connections and recovering, resulting in a lot of EC data reconstruct, but a 
lot of old blk failed to clean up correctly. Has this been repaired? What patch 
do I need to add, thank you

The following is a detailed check log

 
{code:java}
datanode delete data ec blk ?
 grep blk_-9223372036371044656  
hadoop-hdfs-root-datanode-fs-hiido-dn-12-66-111.hiido.host.xxx.com.log
2024-07-18 17:25:07,879 INFO  datanode.DataNode 
(DataXceiver.java:writeBlock(738)) - Receiving 
BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 
src: /10.12.66.111:25066 dest: /10.12.66.111:1019
2024-07-18 17:25:17,396 INFO  datanode.DataNode 
(StripedBlockReconstructor.java:run(86)) - ok EC reconstruct striped block: 
BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793  
blockId: -9223372036371044656
2024-07-18 17:25:17,396 INFO  datanode.DataNode 
(DataXceiver.java:writeBlock(914)) - Received 
BP-1822992414-10.12.65.48-1660893388633:blk_-9223372036371044656_1688858793 
src: /10.12.66.111:25066 dest: /10.12.66.111:1019 of size 193986560
2024-07-18 17:25:25,465 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(225)) - Scheduling 
blk_-9223372036371044656_1688858793 replica FinalizedReplica, 
blk_-9223372036371044656_1688858793, FINALIZED
  getBlockURI()     = 
file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656
 for deletion
2024-07-18 17:25:25,746 INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:run(333)) - Deleted 
BP-1822992414-10.12.65.48-1660893388633 blk_-9223372036371044656_1688858793 URI 
file:/data4/hadoop/dfs/data/current/BP-1822992414-10.12.65.48-1660893388633/current/finalized/subdir21/subdir6/blk_-9223372036371044656=my
 config
dfs.blockreport.intervalMsec    =2160namenode3 log
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
 04:34:39,523 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
 04:34:40,131 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
 10:34:38,950 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-18
 10:34:39,559 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18
 16:34:38,564 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log:2024-07-18
 16:34:39,190 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17
 04:34:39,462 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17
 04:34:40,083 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17
 10:34:39,686 WARN  BlockStateChange (BlockManager.java:addStoredBlock(3238)) - 
BLOCK* addStoredBlock: block blk_-9223372036371044656_1688858793 moved to 
storageType DISK on node 10.12.66.154:1019
hadoop-hdfs-namenode-fs-hiido-yycluster06-yynn3.hiido.host.xxxyy.com.log.1:2024-07-17

[jira] [Created] (HDFS-17588) RBF: Clients using RouterObserverReadProxyProvider should first perform msync.

2024-07-22 Thread fuchaohong (Jira)
fuchaohong created HDFS-17588:
-

 Summary: RBF: Clients using RouterObserverReadProxyProvider should 
first perform msync.
 Key: HDFS-17588
 URL: https://issues.apache.org/jira/browse/HDFS-17588
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Reporter: fuchaohong


When using RouterObserverReadProxyProvider to initiate the first RPC request, 
the router routes this RPC to the active namenode and updates the stateid of 
the corresponding nameservice. However, the stateid of other nameservices is 
not updated. Clients should first perform msync to update the stateid of all 
nameservices with enabled Observers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17587) "StorageTypeStats" Metric should not include decommissioned/in-maintenance nodes

2024-07-22 Thread Mohamed Aashif (Jira)
Mohamed Aashif created HDFS-17587:
-

 Summary: "StorageTypeStats" Metric should not include 
decommissioned/in-maintenance nodes
 Key: HDFS-17587
 URL: https://issues.apache.org/jira/browse/HDFS-17587
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Mohamed Aashif






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-17575) SaslDataTransferClient should use SaslParticipant to create messages

2024-07-21 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reopened HDFS-17575:
---

The [pull request 6933 |https://github.com/apache/hadoop/pull/6933] has caused 
test failure.  Reverted it.

> SaslDataTransferClient should use SaslParticipant to create messages
> 
>
> Key: HDFS-17575
> URL: https://issues.apache.org/jira/browse/HDFS-17575
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Currently, a SaslDataTransferClient may send a message without using its 
> SaslParticipant as below.  {code}
>   sendSaslMessage(out, new byte[0]);
> {code}
> Instead, it should use its SaslParticipant to create the response.
> {code}
>   byte[] localResponse = sasl.evaluateChallengeOrResponse(remoteResponse);
>   sendSaslMessage(out, localResponse);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17576) Support user defined auth Callback

2024-07-20 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-17576.
---
Fix Version/s: 3.3.7
 Hadoop Flags: Reviewed
   Resolution: Fixed

The pull request is now merged.

> Support user defined auth Callback
> --
>
> Key: HDFS-17576
> URL: https://issues.apache.org/jira/browse/HDFS-17576
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.7
>
>
> Some security provider may define a new 
> javax.security.auth.callback.Callback.  This JIRA is to allow users to 
> configure a customized callback handler in such case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17575) SaslDataTransferClient should use SaslParticipant to create messages

2024-07-20 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-17575.
---
Fix Version/s: 3.3.7
   Resolution: Fixed

The pull request is now merged.

> SaslDataTransferClient should use SaslParticipant to create messages
> 
>
> Key: HDFS-17575
> URL: https://issues.apache.org/jira/browse/HDFS-17575
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.7
>
>
> Currently, a SaslDataTransferClient may send a message without using its 
> SaslParticipant as below.  {code}
>   sendSaslMessage(out, new byte[0]);
> {code}
> Instead, it should use its SaslParticipant to create the response.
> {code}
>   byte[] localResponse = sasl.evaluateChallengeOrResponse(remoteResponse);
>   sendSaslMessage(out, localResponse);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17586) Fix timestamp in rbfbalance tool.

2024-07-19 Thread Zhaobo Huang (Jira)
Zhaobo Huang created HDFS-17586:
---

 Summary: Fix timestamp in rbfbalance tool.
 Key: HDFS-17586
 URL: https://issues.apache.org/jira/browse/HDFS-17586
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Zhaobo Huang


When the 'Federation Balance' Tool calls the 'DistCp' tool, the timestamp is 
not retained.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17584) DistributedFileSystem verifyChecksum should be configurable

2024-07-16 Thread Liangjun He (Jira)
Liangjun He created HDFS-17584:
--

 Summary: DistributedFileSystem verifyChecksum should be 
configurable
 Key: HDFS-17584
 URL: https://issues.apache.org/jira/browse/HDFS-17584
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client
Reporter: Liangjun He
Assignee: Liangjun He


In some of our POC scenarios, we would like to set the verifyChecksum of 
DistributedFileSystem to false, but currently, verifyChecksum is not 
configurable and the default value is true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17583) Support auto-refresh of latest viewDFS mount configuration

2024-07-16 Thread Palakur Eshwitha Sai (Jira)
Palakur Eshwitha Sai created HDFS-17583:
---

 Summary: Support auto-refresh of latest viewDFS mount configuration
 Key: HDFS-17583
 URL: https://issues.apache.org/jira/browse/HDFS-17583
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Palakur Eshwitha Sai
Assignee: Palakur Eshwitha Sai


Currently, the central mount table configuration feature for viewDFS loads the 
latest mounts each time the viewDFS initialize function is called. But in the 
case of use cases with hive, hive does the call to initialize the fs only for 
the first time after HS2 restart. From then on, it fetches the filesystem/ 
resolved mount points from the cache.

This requires HS2 restart each time some data is moved to S3 or other hadoop 
compatible file systems and a new mount-table.xml file is added to the viewDFS 
central mount config directory, which is not ideal.

Should implement a way in which the mount table is auto loaded at specific 
intervals/ each time the central mount-table directory is updated with a new 
mount-table.xml file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17582) Distcp support fastcopy

2024-07-15 Thread liuguanghua (Jira)
liuguanghua created HDFS-17582:
--

 Summary: Distcp support fastcopy 
 Key: HDFS-17582
 URL: https://issues.apache.org/jira/browse/HDFS-17582
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Reporter: liuguanghua
Assignee: liuguanghua


DistCp support fastcopy  for distribute data replication cross same nameservice 
or different nameservices in hdfs federation cluster.

This is depend on 
 # HDFS-16757
 # HDFS-17581



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17581) Add FastCopy tool and support dfs -fastcp command

2024-07-15 Thread liuguanghua (Jira)
liuguanghua created HDFS-17581:
--

 Summary: Add FastCopy tool and support dfs -fastcp command
 Key: HDFS-17581
 URL: https://issues.apache.org/jira/browse/HDFS-17581
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Reporter: liuguanghua
Assignee: liuguanghua


Add FastCopy Tool :

(1) support data replication with replication files

(2) support data replication with ec files

And add hdfs dfs -fastcp command for copy file use fastcopy.  And the fastcp is 
similar to cp command

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17580) Change the default value of dfs.datanode.lock.fair to false due to potential hang

2024-07-15 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba resolved HDFS-17580.
--
Resolution: Won't Fix

Only change to false can not prevent datanode hang. 

We need some other methods.

> Change the default value of dfs.datanode.lock.fair to false due to potential 
> hang
> -
>
> Key: HDFS-17580
> URL: https://issues.apache.org/jira/browse/HDFS-17580
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17580) Change the default value of dfs.datanode.lock.fair to false

2024-07-15 Thread farmmamba (Jira)
farmmamba created HDFS-17580:


 Summary: Change the default value of dfs.datanode.lock.fair to 
false
 Key: HDFS-17580
 URL: https://issues.apache.org/jira/browse/HDFS-17580
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17579) [DatanodeAdminDefaultMonitor] Better Comments Explaining Why Blocks Need Reconstruction May Not Block Decommission/Maintenance

2024-07-12 Thread wuchang (Jira)
wuchang created HDFS-17579:
--

 Summary: [DatanodeAdminDefaultMonitor] Better Comments Explaining 
Why Blocks Need Reconstruction May Not Block Decommission/Maintenance
 Key: HDFS-17579
 URL: https://issues.apache.org/jira/browse/HDFS-17579
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsadmin
Reporter: wuchang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17566) Got wrong sorted block order when StorageType is considered.

2024-07-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17566.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Got wrong sorted block order when StorageType is considered.
> 
>
> Key: HDFS-17566
> URL: https://issues.apache.org/jira/browse/HDFS-17566
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> I found unit test failures like below:
> ```
> [ERROR] Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 9.146 s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager
> [ERROR] 
> testGetBlockLocationConsiderStorageType(org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager)
>   Time elapsed: 0.206 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but was:
>     at org.junit.Assert.assertEquals(Assert.java:117)
>     at org.junit.Assert.assertEquals(Assert.java:146)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>     at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>     at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>     at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>     at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>     at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>     at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>     at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>     at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>     at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
>     at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>     at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>     at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>     at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>     at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>     at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>     at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>     at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> ```
>  
> The reason is that in HDFS-17098 comparator order is wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17578) ShellCommandFencer#setConfAsEnvVars should also replace '-' with '_'.

2024-07-10 Thread fuchaohong (Jira)
fuchaohong created HDFS-17578:
-

 Summary: ShellCommandFencer#setConfAsEnvVars should also replace 
'-' with '_'.
 Key: HDFS-17578
 URL: https://issues.apache.org/jira/browse/HDFS-17578
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: fuchaohong


When setting configuration into environment variables, '-' should also be 
replaced with '_'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17577) Add Support for CreateFlag.NO_LOCAL_WRITE in File Creation to Manage Disk Space and Network Load in Labeled YARN Nodes

2024-07-09 Thread liang yu (Jira)
liang yu created HDFS-17577:
---

 Summary: Add Support for CreateFlag.NO_LOCAL_WRITE in File 
Creation to Manage Disk Space and Network Load in Labeled YARN Nodes
 Key: HDFS-17577
 URL: https://issues.apache.org/jira/browse/HDFS-17577
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: dfsclient
Reporter: liang yu


{*}Description{*}: I am currently using Apache Flink to write files into 
Hadoop. The Flink application runs on a labeled YARN queue. During operation, 
it has been observed that the local disks on these labeled nodes get filled up 
quickly, and the network load is significantly high. This issue arises because 
Hadoop prioritizes writing files to the local node first, and the number of 
these labeled nodes is quite limited.

 

{*}Problem{*}: The current behavior leads to inefficient disk space utilization 
and high network traffic on these few labeled nodes, which could potentially 
affect the performance and reliability of the application.

 

{*}Implementation{*}: The implementation would involve adding an configuration 
_dfs.client.write.no_local_write_ to support the {{CreateFlag.NO_LOCAL_WRITE}} 
during the file creation process in Hadoop's file system APIs. This will 
provide flexibility to applications like Flink running in labeled queues to opt 
for non-local writes when necessary.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17575) SaslDataTransferClient should use SaslParticipant to create messages

2024-07-09 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created HDFS-17575:
-

 Summary: SaslDataTransferClient should use SaslParticipant to 
create messages
 Key: HDFS-17575
 URL: https://issues.apache.org/jira/browse/HDFS-17575
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: security
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze


Currently, a SaslDataTransferClient may send a message without using its 
SaslParticipant as below.  {code}
  sendSaslMessage(out, new byte[0]);
{code}
Instead, it should use its SaslParticipant to create the response.
{code}
  byte[] localResponse = sasl.evaluateChallengeOrResponse(remoteResponse);
  sendSaslMessage(out, localResponse);
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-10535) Rename AsyncDistributedFileSystem

2024-07-09 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-10535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-10535.
---
Resolution: Won't Fix

This JIRA became stale.

> Rename AsyncDistributedFileSystem
> -
>
> Key: HDFS-10535
> URL: https://issues.apache.org/jira/browse/HDFS-10535
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: h10535_20160616.patch
>
>
> Per discussion in HDFS-9924, AsyncDistributedFileSystem is not a good name 
> since we only support nonblocking calls for the moment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-11948) Ozone: change TestRatisManager to check cluster with data

2024-07-09 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-11948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-11948.
---
Resolution: Won't Fix

This JIRA became stale.

> Ozone: change TestRatisManager to check cluster with data
> -
>
> Key: HDFS-11948
> URL: https://issues.apache.org/jira/browse/HDFS-11948
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ozone
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: OzonePostMerge
> Attachments: HDFS-11948-HDFS-7240.20170614.patch, 
> HDFS-11948-HDFS-7240.20170731.patch
>
>
> TestRatisManager first creates multiple Ratis clusters.  Then it changes the 
> membership and closes some clusters.  However, it does not test the clusters 
> with data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-11734) Ozone: provide a way to validate ContainerCommandRequestProto

2024-07-09 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-11734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-11734.
---
Resolution: Won't Fix

This JIRA became stale.

> Ozone: provide a way to validate ContainerCommandRequestProto
> -
>
> Key: HDFS-11734
> URL: https://issues.apache.org/jira/browse/HDFS-11734
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ozone
>Reporter: Tsz-wo Sze
>Assignee: Anu Engineer
>Priority: Critical
>  Labels: OzonePostMerge, tocheck
>
> We need some API to check if a ContainerCommandRequestProto is valid.
> It is useful when the container pipeline is run with Ratis.  Then, the leader 
> could first checks if a ContainerCommandRequestProto is valid before the 
> request is propagated to the followers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-11735) Ozone: In Ratis, leader should validate ContainerCommandRequestProto before propagating it to followers

2024-07-09 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-11735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-11735.
---
Resolution: Won't Fix

This JIRA became stale.

> Ozone: In Ratis, leader should validate ContainerCommandRequestProto before 
> propagating it to followers
> ---
>
> Key: HDFS-11735
> URL: https://issues.apache.org/jira/browse/HDFS-11735
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ozone
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: OzonePostMerge, tocheck
> Attachments: HDFS-11735-HDFS-7240.20170501.patch
>
>
> The leader should use the API provided by HDFS-11734 to check if a 
> ContainerCommandRequestProto is valid before propagating it to followers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17574) Make NNThroughputBenchmark support argument blockSize suffix with k, m, g, t, p, e

2024-07-09 Thread wangzhongwei (Jira)
wangzhongwei created HDFS-17574:
---

 Summary: Make NNThroughputBenchmark support  argument blockSize 
suffix with k, m, g, t, p, e
 Key: HDFS-17574
 URL: https://issues.apache.org/jira/browse/HDFS-17574
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: benchmarks, hdfs
Affects Versions: 3.3.6, 3.3.3
Reporter: wangzhongwei
Assignee: wangzhongwei


As of now,We can not specify data units while specifying the -blockSize as 
arguments(like 1m), but have to specify numbers
test command:
hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
[hdfs://|hdfs://ctyunns/]x -op create -threads 100 -files 25 
-filesPerDir 100  -blockSize 1m -close 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16714) Remove okhttp and kotlin dependencies

2024-07-09 Thread Cheng Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Pan resolved HDFS-16714.
--
Resolution: Duplicate

> Remove okhttp and kotlin dependencies
> -
>
> Key: HDFS-16714
> URL: https://issues.apache.org/jira/browse/HDFS-16714
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.3.4
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> hadoop-common already has apache http client dependencies, okhttp is 
> unnecessary



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17573) Add test code for FSImage parallelization and compression

2024-07-08 Thread Sungdong Kim (Jira)
Sungdong Kim created HDFS-17573:
---

 Summary: Add test code for FSImage parallelization and compression
 Key: HDFS-17573
 URL: https://issues.apache.org/jira/browse/HDFS-17573
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, namenode
Affects Versions: 3.4.1
Reporter: Sungdong Kim
 Fix For: 3.4.1


The feature added HDFS-14617(in Improve FSImage load time by writing 
sub-sections to the FSImage index. by [Stephen 
O'Donnell|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=sodonnell])
 makes loading FSImage very faster.

 

But this option cannot be activated when turn on dfs.image.compress=true.

In my opinion, larger clusters require both settings at the same time.

For Example, the cluster I'm using has approximately 6 million file system 
objects and FSImage is approximately 11GB with dfs.image.compress=true setting.

If turn off the dfs.image.compress option, it is expected to exceed 30GB, in 
which case it will take a long time to move FSImage from standby to active 
namenode using high network resource.

 

It was proved in this jira(HDFS-16147 by 
[kinit|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mofei]) that 
loading FSImage parallel and FSImage compression can be turned on at the same 
time.  (And worked well on my environment also.)


I created this new Jira and PR because the discussion in HDFS-16147 ended in 
2021, and I want it to be officially added in the next release, instead of 
patch available.

The actual code of the patch was written by 
[kinit|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=mofei] and I 
resolved empty sub-section problem(see below comment of HDFS-16147) and added 
test code.


If this is not a proper method, please let me know another way to contribute.

Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17571) TestDatanodeManager#testGetBlockLocationConsiderStorageType is flaky

2024-07-08 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HDFS-17571.
-
Resolution: Duplicate

> TestDatanodeManager#testGetBlockLocationConsiderStorageType is flaky
> 
>
> Key: HDFS-17571
> URL: https://issues.apache.org/jira/browse/HDFS-17571
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Priority: Major
>
> {noformat}
> org.junit.ComparisonFailure: expected: but was:
>   at org.junit.Assert.assertEquals(Assert.java:117)
>   at org.junit.Assert.assertEquals(Assert.java:146)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> {noformat}
> Ref: 
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6906/2/testReport/org.apache.hadoop.hdfs.server.blockmanagement/TestDatanodeManager/testGetBlockLocationConsiderStorageType/
> https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1628/testReport/junit/org.apache.hadoop.hdfs.server.blockmanagement/TestDatanodeManager/testGetBlockLocationConsiderStorageType/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17557) Fix bug for TestRedundancyMonitor#testChooseTargetWhenAllDataNodesStop

2024-07-06 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HDFS-17557.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix bug for TestRedundancyMonitor#testChooseTargetWhenAllDataNodesStop
> --
>
> Key: HDFS-17557
> URL: https://issues.apache.org/jira/browse/HDFS-17557
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Due to the modification in HDFS-16456, the current UT has not been able to 
> run successfully, so we need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17564) EC: Fix the issue of inaccurate metrics when decommission mark busy DN

2024-07-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17564.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix the issue of inaccurate metrics when decommission mark busy DN
> --
>
> Key: HDFS-17564
> URL: https://issues.apache.org/jira/browse/HDFS-17564
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> If DataNode is marked as busy and contains many EC blocks, when running 
> decommission DataNode, when execute ErasureCodingWork#addTaskToDatanode, here 
> will no replication work will be generated for ecBlocksToBeReplicated, but 
> related metrics (such as DatanodeDescriptor#currApproxBlocksScheduled, 
> pendingReconstruction and needReconstruction) will still updated.
> *Specific code:*
> BlockManager#scheduleReconstruction -> BlockManager#chooseSourceDatanodes 
> [2628~2650] 
> If DataNode is marked as busy and contains many EC blocks here will not add 
> to srcNodes.
> .
> {code:java}
> @VisibleForTesting
> DatanodeDescriptor[] chooseSourceDatanodes(BlockInfo block,
> List containingNodes,
> List nodesContainingLiveReplicas,
> NumberReplicas numReplicas, List liveBlockIndices,
> List liveBusyBlockIndices, List excludeReconstructed, int 
> priority) {
>   containingNodes.clear();
>   nodesContainingLiveReplicas.clear();
>   List srcNodes = new ArrayList<>();
>  ...
>   for (DatanodeStorageInfo storage : blocksMap.getStorages(block)) {
> final DatanodeDescriptor node = getDatanodeDescriptorFromStorage(storage);
> final StoredReplicaState state = checkReplicaOnStorage(numReplicas, block,
> storage, corruptReplicas.getNodes(block), false);
> ...
> // for EC here need to make sure the numReplicas replicates state correct
> // because in the scheduleReconstruction it need the numReplicas to check
> // whether need to reconstruct the ec internal block
> byte blockIndex = -1;
> if (isStriped) {
>   blockIndex = ((BlockInfoStriped) block)
>   .getStorageBlockIndex(storage);
>   countLiveAndDecommissioningReplicas(numReplicas, state,
>   liveBitSet, decommissioningBitSet, blockIndex);
> }
> if (priority != LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY
> && (!node.isDecommissionInProgress() && !node.isEnteringMaintenance())
> && node.getNumberOfBlocksToBeReplicated() +
> node.getNumberOfBlocksToBeErasureCoded() >= maxReplicationStreams) {
>   if (isStriped && (state == StoredReplicaState.LIVE
> || state == StoredReplicaState.DECOMMISSIONING)) {
> liveBusyBlockIndices.add(blockIndex);
> //HDFS-16566 ExcludeReconstructed won't be reconstructed.
> excludeReconstructed.add(blockIndex);
>   }
>   continue; // already reached replication limit
> }
> if (node.getNumberOfBlocksToBeReplicated() +
> node.getNumberOfBlocksToBeErasureCoded() >= 
> replicationStreamsHardLimit) {
>   if (isStriped && (state == StoredReplicaState.LIVE
> || state == StoredReplicaState.DECOMMISSIONING)) {
> liveBusyBlockIndices.add(blockIndex);
> //HDFS-16566 ExcludeReconstructed won't be reconstructed.
> excludeReconstructed.add(blockIndex);
>   }
>   continue;
> }
> if(isStriped || srcNodes.isEmpty()) {
>   srcNodes.add(node);
>   if (isStriped) {
> liveBlockIndices.add(blockIndex);
>   }
>   continue;
> }
>...
> {code}
> ErasureCodingWork#addTaskToDatanode[149~157]
> {code:java}
> @Override
> void addTaskToDatanode(NumberReplicas numberReplicas) {
>   final DatanodeStorageInfo[] targets = getTargets();
>   assert targets.length > 0;
>   BlockInfoStriped stripedBlk = (BlockInfoStriped) getBlock();
>   ...
>   } else if ((numberReplicas.decommissioning() > 0 ||
>   numberReplicas.liveEnteringMaintenanceReplicas() > 0) &&
>   hasAllInternalBlocks()) {
> List leavingServiceSources = findLeavingServiceSources();
> // decommissioningSources.size() should be >= targets.length
> // if the leavingServiceSources size is 0,  here will not to 
> createReplicationWork
> final int num = Math.min(leavingServiceSources.size(), targets.length);
> for (int i = 0; i < num; i++) {
&

[jira] [Created] (HDFS-17572) TestRouterSecurityManager#testDelegationTokens is flaky

2024-07-05 Thread Ayush Saxena (Jira)
Ayush Saxena created HDFS-17572:
---

 Summary: TestRouterSecurityManager#testDelegationTokens is flaky
 Key: HDFS-17572
 URL: https://issues.apache.org/jira/browse/HDFS-17572
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ayush Saxena



{noformat}
Expected: (an instance of 
org.apache.hadoop.security.token.SecretManager$InvalidToken and exception with 
message a string containing "Renewal request for unknown token")
 but: exception with message a string containing "Renewal request for 
unknown token" message was "some_renewer tried to renew an expired token (token 
for router: HDFS_DELEGATION_TOKEN owner=router, renewer=some_renewer, 
realUser=, issueDate=1720114742074, maxDate=1720114742174, sequenceNumber=6, 
masterKeyId=37) max expiration date: 2024-07-04 17:39:02,174+ currentTime: 
2024-07-04 17:39:02,233+"
Stacktrace was: org.apache.hadoop.security.token.SecretManager$InvalidToken: 
some_renewer tried to renew an expired token (token for router: 
HDFS_DELEGATION_TOKEN owner=router, renewer=some_renewer, realUser=, 
issueDate=1720114742074, maxDate=1720114742174, sequenceNumber=6, 
masterKeyId=37) max expiration date: 2024-07-04 17:39:02,174+ currentTime: 
2024-07-04 17:39:02,233+
 at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.renewToken(AbstractDelegationTokenSecretManager.java:692)
 at 
org.apache.hadoop.hdfs.server.federation.router.security.RouterSecurityManager.renewDelegationToken(RouterSecurityManager.java:180)
 at 
org.apache.hadoop.hdfs.server.federation.security.TestRouterSecurityManager.testDelegationTokens(TestRouterSecurityManager.java:140)
{noformat}

Ref:
https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1628/testReport/junit/org.apache.hadoop.hdfs.server.federation.security/TestRouterSecurityManager/testDelegationTokens/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17571) TestDatanodeManager#testGetBlockLocationConsiderStorageType is flaky

2024-07-05 Thread Ayush Saxena (Jira)
Ayush Saxena created HDFS-17571:
---

 Summary: 
TestDatanodeManager#testGetBlockLocationConsiderStorageType is flaky
 Key: HDFS-17571
 URL: https://issues.apache.org/jira/browse/HDFS-17571
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ayush Saxena



{noformat}
org.junit.ComparisonFailure: expected: but was:
at org.junit.Assert.assertEquals(Assert.java:117)
at org.junit.Assert.assertEquals(Assert.java:146)
at 
org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
{noformat}

Ref: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6906/2/testReport/org.apache.hadoop.hdfs.server.blockmanagement/TestDatanodeManager/testGetBlockLocationConsiderStorageType/

https://ci-hadoop.apache.org/view/Hadoop/job/hadoop-qbt-trunk-java8-linux-x86_64/1628/testReport/junit/org.apache.hadoop.hdfs.server.blockmanagement/TestDatanodeManager/testGetBlockLocationConsiderStorageType/




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17570) Respect Non-Default HADOOP_ROOT_LOGGER when HADOOP_DAEMON_ROOT_LOGGER is not specified in Daemon mode

2024-07-04 Thread wuchang (Jira)
wuchang created HDFS-17570:
--

 Summary: Respect Non-Default HADOOP_ROOT_LOGGER when 
HADOOP_DAEMON_ROOT_LOGGER is not specified in Daemon mode
 Key: HDFS-17570
 URL: https://issues.apache.org/jira/browse/HDFS-17570
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: wuchang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17569) Setup Effective Work Number when Generating Block Reconstruction Work

2024-07-04 Thread wuchang (Jira)
wuchang created HDFS-17569:
--

 Summary: Setup Effective Work Number when Generating Block 
Reconstruction Work
 Key: HDFS-17569
 URL: https://issues.apache.org/jira/browse/HDFS-17569
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: wuchang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17568) [Decommission]Show Info Log for Repeated Useless refreshNode Operation

2024-07-04 Thread wuchang (Jira)
wuchang created HDFS-17568:
--

 Summary: [Decommission]Show Info Log for Repeated Useless 
refreshNode Operation
 Key: HDFS-17568
 URL: https://issues.apache.org/jira/browse/HDFS-17568
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: wuchang


[https://github.com/apache/hadoop/pull/6921]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17567) Return value of method RouterRpcClient#invokeSequential is not accurate

2024-07-03 Thread farmmamba (Jira)
farmmamba created HDFS-17567:


 Summary: Return value of method RouterRpcClient#invokeSequential 
is not accurate
 Key: HDFS-17567
 URL: https://issues.apache.org/jira/browse/HDFS-17567
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Affects Versions: 3.4.0
Reporter: farmmamba
Assignee: farmmamba






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17566) Got wrong sorted block order when StorageType is considered.

2024-07-03 Thread Chenyu Zheng (Jira)
Chenyu Zheng created HDFS-17566:
---

 Summary: Got wrong sorted block order when StorageType is 
considered.
 Key: HDFS-17566
 URL: https://issues.apache.org/jira/browse/HDFS-17566
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Chenyu Zheng
Assignee: Chenyu Zheng


I found unit test failures like below:

```

[ERROR] Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 9.146 
s <<< FAILURE! - in 
org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager
[ERROR] 
testGetBlockLocationConsiderStorageType(org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager)
  Time elapsed: 0.206 s  <<< FAILURE!
org.junit.ComparisonFailure: expected: but was:
    at org.junit.Assert.assertEquals(Assert.java:117)
    at org.junit.Assert.assertEquals(Assert.java:146)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
    at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
    at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
    at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
    at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
    at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
    at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
    at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
    at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

```

 

The reason is that in HDFS-17098 comparator order is wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17565) EC: dfs.datanode.ec.reconstruction.threads should be configurable.

2024-07-02 Thread Chenyu Zheng (Jira)
Chenyu Zheng created HDFS-17565:
---

 Summary: EC: dfs.datanode.ec.reconstruction.threads should be 
configurable.
 Key: HDFS-17565
 URL: https://issues.apache.org/jira/browse/HDFS-17565
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chenyu Zheng
Assignee: Chenyu Zheng


dfs.datanode.ec.reconstruction.threads should be configured, then we can adjust 
the speed of ec block copy. Especially HDFS-17550 wanna decommissioning 
DataNode by EC block reconstruction.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17564) Erasure Coding: Fix the issue of inaccurate metrics when decommission mark busy DN

2024-06-30 Thread Haiyang Hu (Jira)
Haiyang Hu created HDFS-17564:
-

 Summary: Erasure Coding: Fix the issue of inaccurate metrics when 
decommission mark busy DN
 Key: HDFS-17564
 URL: https://issues.apache.org/jira/browse/HDFS-17564
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Haiyang Hu
Assignee: Haiyang Hu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17563) IPC's epoch is not the current writer epoch

2024-06-28 Thread Yonghao Zou (Jira)
Yonghao Zou created HDFS-17563:
--

 Summary: IPC's epoch is not the current writer epoch
 Key: HDFS-17563
 URL: https://issues.apache.org/jira/browse/HDFS-17563
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ipc
Affects Versions: 3.3.4, 3.2.4
Reporter: Yonghao Zou


I got the following errors when running a cluster:

 

 
{code:java}
2024-06-28 03:07:57,334 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 
127.0.0.1:8485 failed to write txns 4-5. Will try to write to this JN again 
after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 1 is 
not the current writer epoch  0 ; journal id: mycluster
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:521)
        at 
org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:398)
        at 
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:191)
        at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.journal(QJournalProtocolServerSideTranslatorPB.java:164)
        at 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:28974)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:549)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:518)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1035)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:963)
        at java.base/java.security.AccessController.doPrivileged(Native Method)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2960)        at 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)
        at org.apache.hadoop.ipc.Client.call(Client.java:1558)
        at org.apache.hadoop.ipc.Client.call(Client.java:1455)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy14.journal(Unknown Source)
        at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolTranslatorPB.journal(QJournalProtocolTranslatorPB.java:191)
        at 
org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:401)
        at 
org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel$7.call(IPCLoggerChannel.java:394)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at 
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
        at 
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829){code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17562) NPE in ipc/Client.java

2024-06-28 Thread Yonghao Zou (Jira)
Yonghao Zou created HDFS-17562:
--

 Summary: NPE in ipc/Client.java
 Key: HDFS-17562
 URL: https://issues.apache.org/jira/browse/HDFS-17562
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ipc
Affects Versions: 3.3.6, 3.3.4, 3.2.4
Reporter: Yonghao Zou


An NPE happened today that crashed datanodes.

 
{code:java}
2024-06-28 03:07:58,649 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in offerService    java.io.IOException: DestHost:destPort 
localhost:9000 , LocalHost:localPort e07ff098d9e2/172.17.0.4:0. Failed on local 
exception: java.io.IOException: Error reading responses            at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)            at 
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
            at 
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at 
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)       
     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:842)       
     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:817)         
   at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1616)            
at org.apache.hadoop.ipc.Client.call(Client.java:1558)            at 
org.apache.hadoop.ipc.Client.call(Client.java:1455)            at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:231)
            at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
            at com.sun.proxy.$Proxy19.sendHeartbeat(Unknown Source)            
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:168)
            at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:524)
            at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:658)
            at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:855)
            at java.base/java.lang.Thread.run(Thread.java:829)    Caused by: 
java.io.IOException: Error reading responses            at 
org.apache.hadoop.ipc.Client$Connection.run(Client.java:1141)    Caused by: 
java.lang.NullPointerException            at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1252)    
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1134) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17561) Make HeartbeatManager.Monitor use up-to-date heartbeatRecheckInterval

2024-06-28 Thread Felix N (Jira)
Felix N created HDFS-17561:
--

 Summary: Make HeartbeatManager.Monitor use up-to-date 
heartbeatRecheckInterval
 Key: HDFS-17561
 URL: https://issues.apache.org/jira/browse/HDFS-17561
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Felix N
Assignee: Felix N


DatanodeManager can changes heartbeatRecheckInterval via reconf API but 
HeartbeatManager's copy of heartbeatRecheckInterval is fixed at initialization 
and won't update when DatanodeManager updates with a new config.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17560) When CurrentCall does not include StateId, it can still send requests to the Observer.

2024-06-27 Thread fuchaohong (Jira)
fuchaohong created HDFS-17560:
-

 Summary: When CurrentCall does not include StateId, it can still 
send requests to the Observer.
 Key: HDFS-17560
 URL: https://issues.apache.org/jira/browse/HDFS-17560
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Reporter: fuchaohong


When the size of the federated state propagated to the client exceeds 
maxSizeOfFederatedStateToPropagate, all requests are forwarded to the active 
NameNode. I don't think this is very reasonable.When enabling Observer read and 
there is no FederatedState propagated to the client, advanceClientStateId can 
use sharedGlobalStateId to assign poolLocalStateId and send the request to the 
Observer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17559) Fix the uuid as null in NameNodeMXBean

2024-06-27 Thread Haiyang Hu (Jira)
Haiyang Hu created HDFS-17559:
-

 Summary: Fix the uuid as null in NameNodeMXBean
 Key: HDFS-17559
 URL: https://issues.apache.org/jira/browse/HDFS-17559
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Haiyang Hu
Assignee: Haiyang Hu


If there is datanode info in includes, but the datanode service is not 
currently started, the uuid of the datanode will be null. When getting the 
DeadNodes DeadNodes metric, the following exception will occur:
{code:java}
2024-06-26 17:06:49,698 ERROR jmx.JMXJsonServlet 
(JMXJsonServlet.java:writeAttribute(345)) [qtp1107412069-7704] - getting 
attribute DeadNodes of Hadoop:service=NameNode,name=NameNodeInfo threw an 
exception javax.management.RuntimeMBeanException: 
java.lang.NullPointerException: null value in entry: uuid=null
        at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrow(DefaultMBeanServerInterceptor.java:839)
        at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.rethrowMaybeMBeanException(DefaultMBeanServerInterceptor.java:852)
        at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:651)
        at 
com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:678)
        at 
org.apache.hadoop.jmx.JMXJsonServlet.writeAttribute(JMXJsonServlet.java:338)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17558) RBF: Make maxSizeOfFederatedStateToPropagate work on setResponseHeaderState.

2024-06-26 Thread fuchaohong (Jira)
fuchaohong created HDFS-17558:
-

 Summary: RBF: Make maxSizeOfFederatedStateToPropagate work on 
setResponseHeaderState.
 Key: HDFS-17558
 URL: https://issues.apache.org/jira/browse/HDFS-17558
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Reporter: fuchaohong


When the size of namespaceIdMap exceeds 
RBFConfigKeys.DFS_ROUTER_OBSERVER_FEDERATED_STATE_PROPAGATION_MAXSIZE, the 
federated state does not propagate. This behavior is inconsistent with the 
configuration description, which states that the size of the federated state 
propagated to the client should be limited.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17557) Fix bug for TestRedundancyMonitor#testChooseTargetWhenAllDataNodesStop

2024-06-22 Thread Haiyang Hu (Jira)
Haiyang Hu created HDFS-17557:
-

 Summary: Fix bug for 
TestRedundancyMonitor#testChooseTargetWhenAllDataNodesStop
 Key: HDFS-17557
 URL: https://issues.apache.org/jira/browse/HDFS-17557
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Haiyang Hu
Assignee: Haiyang Hu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17556) timedOutItems should also be check to avoid repeated add to neededReconstruction when decommission

2024-06-21 Thread caozhiqiang (Jira)
caozhiqiang created HDFS-17556:
--

 Summary: timedOutItems should also be check to avoid repeated add 
to neededReconstruction when decommission
 Key: HDFS-17556
 URL: https://issues.apache.org/jira/browse/HDFS-17556
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namanode
Affects Versions: 3.5.0
Reporter: caozhiqiang
Assignee: caozhiqiang


In decommission and maintenance process, before added to 
BlockManager::neededReconstruction block will be check if it has been added. 
The check contains if block is in BlockManager::neededReconstruction or in 
PendingReconstructionBlocks::pendingReconstructions as below code. 
But it also need to check if it is in 
PendingReconstructionBlocks::timedOutItems. Or else DatanodeAdminDefaultMonitor 
will add block to BlockManager::neededReconstruction repeatedly if block time 
out in PendingReconstructionBlocks::pendingReconstructions.
 
{code:java}
if (!blockManager.neededReconstruction.contains(block) &&
blockManager.pendingReconstruction.getNumReplicas(block) == 0 &&
blockManager.isPopulatingReplQueues()) {
  // Process these blocks only when active NN is out of safe mode.
  blockManager.neededReconstruction.add(block,
  liveReplicas, num.readOnlyReplicas(),
  num.outOfServiceReplicas(),
  blockManager.getExpectedRedundancyNum(block));
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17555) fix NumberFormatException using letter suffix with conf dfs.blocksize

2024-06-20 Thread wangzhongwei (Jira)
wangzhongwei created HDFS-17555:
---

 Summary: fix NumberFormatException using letter suffix with conf 
dfs.blocksize 
 Key: HDFS-17555
 URL: https://issues.apache.org/jira/browse/HDFS-17555
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: benchmarks
Affects Versions: 3.3.6, 3.3.4, 3.3.3, 3.3.5
Reporter: wangzhongwei


     when using NNThroughputBenchmark, the configuration item dfs.blocksize in 
hdfs-site.xml is configured with a letter as the suffix,such as 
256m,NumberFormatException occurred.
command:
hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
[hdfs://|hdfs://ctyunns/]x -op create -threads 100 -files 25 
-filesPerDir 100 -close



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17554) OIV: Print the storage policy name in OIV delimited output

2024-06-18 Thread Hualong Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hualong Zhang resolved HDFS-17554.
--
Resolution: Not A Problem

> OIV: Print the storage policy name in OIV delimited output
> --
>
> Key: HDFS-17554
> URL: https://issues.apache.org/jira/browse/HDFS-17554
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 3.5.0
>Reporter: Hualong Zhang
>Assignee: Hualong Zhang
>Priority: Major
>
> Refer to adding the storage policy name to the OIV output instead of the 
> erasure coding policy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17528) FsImageValidation: set txid when saving a new image

2024-06-18 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved HDFS-17528.
---
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Th pull request is now merged.

> FsImageValidation: set txid when saving a new image
> ---
>
> Key: HDFS-17528
> URL: https://issues.apache.org/jira/browse/HDFS-17528
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: tools
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> - When the fsimage is specified as a file and the FsImageValidation tool 
> saves a new image (for removing inaccessible inodes), the txid is not set.  
> Then, the resulted image will have 0 as its txid.
> - When the fsimage is specified as a directory, the txid is set.  However, it 
> will get NPE since NameNode metrics is uninitialized (although the metrics is 
> not used by FsImageValidation).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17554) OIV: Print the storage policy name in OIV delimited output

2024-06-18 Thread Hualong Zhang (Jira)
Hualong Zhang created HDFS-17554:


 Summary: OIV: Print the storage policy name in OIV delimited output
 Key: HDFS-17554
 URL: https://issues.apache.org/jira/browse/HDFS-17554
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: tools
Affects Versions: 3.5.0
Reporter: Hualong Zhang
Assignee: Hualong Zhang


Refer to adding the storage policy name to the OIV output instead of the 
erasure coding policy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17553) DFSOutputStream.java#closeImpl should have a retry upon flushInternal failures

2024-06-18 Thread Zinan Zhuang (Jira)
Zinan Zhuang created HDFS-17553:
---

 Summary: DFSOutputStream.java#closeImpl should have a retry upon 
flushInternal failures
 Key: HDFS-17553
 URL: https://issues.apache.org/jira/browse/HDFS-17553
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsclient
Affects Versions: 3.4.0, 3.3.1
Reporter: Zinan Zhuang


[HDFS-15865|https://issues.apache.org/jira/browse/HDFS-15865] introduced an 
interrupt in DFSStreamer class to interrupt the 
waitForAckedSeqno call when timeout has exceeded. This method is being used in 
[DFSOutputStream.java#flushInternal 
|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L773]
 , one of whose use case is  
[DFSOutputStream.java#closeImpl|https://github.com/apache/hadoop/blob/branch-3.0.0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L870]
 to close a file. 

What we saw was that we were getting more interrupts during the flushInternal 
call when we are closing out a file, which was unhandled by DFSClient and got 
thrown to caller. There's a known issue 
[HDFS-4504|https://issues.apache.org/jira/browse/HDFS-4504] that when a file 
failed to close on HDFS side, the lease got leaked until the DFSClient gets 
recycled. In our HBase setups, DFSClients remain long-lived in each 
regionserver, which means these files remain undead until the regionserver gets 
restarted. 

This issue was observed during datanode decomission because it was stuck on 
open files caused by above leakage. As it's good to close a HDFS file as smooth 
as possible, a retry of flushInternal during closeImpl operations would be 
beneficial to reduce such leakages. 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17439) Improve NNThroughputBenchmark to allow non super user to use the tool

2024-06-18 Thread Stephen O'Donnell (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell resolved HDFS-17439.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

> Improve NNThroughputBenchmark to allow non super user to use the tool
> -
>
> Key: HDFS-17439
> URL: https://issues.apache.org/jira/browse/HDFS-17439
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: benchmarks, namenode
>Reporter: Fateh Singh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> The NNThroughputBenchmark can only be used with hdfs user or any user with 
> super user privileges since entering/exiting safemode is a privileged 
> operation. However, when using super user, ACL checks are skipped. Hence it 
> renders the tool to be useless when testing namenode performance along with 
> authorization frameworks such as Apache Ranger / any other authorization 
> frameworks.
> An optional argument such as -nonSuperUser can be used to skip the statements 
> such as entering / exiting safemode. This optional argument makes the tool 
> useful for incorporating authorization frameworks into the performance 
> estimation flows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17552) [ARR] RPC client uses CompletableFuture to support asynchronous operations.

2024-06-13 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17552:
-

 Summary: [ARR] RPC client uses CompletableFuture to support 
asynchronous operations.
 Key: HDFS-17552
 URL: https://issues.apache.org/jira/browse/HDFS-17552
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jian Zhang


h3. Description

In the implementation of asynchronous Rpc.client, the main methods used include 
HADOOP-13226, HDFS-10224, etc.

However, the existing implementation does not support `CompletableFuture`; 
instead, it relies on setting up callbacks, which can lead to the "callback 
hell" problem. Using `CompletableFuture` can better organize asynchronous 
callbacks. Therefore, on the basis of the existing implementation, by using 
`CompletableFuture`, once the `client.call` is completed, the asynchronous 
thread handles the response of this call without blocking the main thread.

 

*Test*

new UT ** TestAsyncIPC#testAsyncCallWithCompletableFuture()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17551) Fix unit test failure caused by HDFS-17464

2024-06-12 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HDFS-17551.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix unit test failure caused by HDFS-17464
> --
>
> Key: HDFS-17551
> URL: https://issues.apache.org/jira/browse/HDFS-17551
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> As title.
> This Jira is used to fix unit test failure caused by HDFS-17464.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17551) Fix unit test failure caused by HDFS-17464

2024-06-12 Thread farmmamba (Jira)
farmmamba created HDFS-17551:


 Summary: Fix unit test failure caused by HDFS-17464
 Key: HDFS-17551
 URL: https://issues.apache.org/jira/browse/HDFS-17551
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: farmmamba
Assignee: farmmamba


As title.

This Jira is used to fix unit test failure caused by HDFS-17464.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17539) TestFileChecksum should not spin up a MiniDFSCluster for every test

2024-06-11 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-17539.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> TestFileChecksum should not spin up a MiniDFSCluster for every test
> ---
>
> Key: HDFS-17539
> URL: https://issues.apache.org/jira/browse/HDFS-17539
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Felix N
>Assignee: Felix N
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> TestFileChecksum has 34 tests. Add its brother the parameterized 
> COMPOSITE_CRC version and that's 68 times a cluster is spun up then shutdown 
> when twice is necessary (or maybe even once but 2 is not too bad).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17549) SecretManager should not hardcode HMAC algorithm

2024-06-07 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created HDFS-17549:
-

 Summary: SecretManager should not hardcode HMAC algorithm
 Key: HDFS-17549
 URL: https://issues.apache.org/jira/browse/HDFS-17549
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: security
Reporter: Tsz-wo Sze
Assignee: Tsz-wo Sze






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17548) excessive NO_REQUIRED_STORAGE_TYPE messages

2024-06-07 Thread Szymon Orzechowski (Jira)
Szymon Orzechowski created HDFS-17548:
-

 Summary: excessive NO_REQUIRED_STORAGE_TYPE messages 
 Key: HDFS-17548
 URL: https://issues.apache.org/jira/browse/HDFS-17548
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.3.4
Reporter: Szymon Orzechowski


Notification of unavailable storageType has been implemented in HDFS-15815.

 

Yesterday we noted a failure on our production cluster. As a side result of 
analyzing the reasons for the failure, we found additional error messages:

 

nn-3.wphadoop.dc-2.jumbo._hadoop-hdfs-namenode.log.out:2024-06-07 
00:35:23,381 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough 
replicas was chosen. Reason: \{NO_REQUIRED_STORAGE_TYPE=1}

 

These tell us very little and seem to make absolutely no sense in the case of 
our cluster (12 racks, no storage policies enabled nor storage types defined).
However, in 100% of cases they occur directly (or almost directly) after 
messages like:

 

nn-3.wphadoop.dc-2.jumbo._hadoop-hdfs-namenode.log.out-2024-06-07 
00:35:23,380 INFO org.apache.hadoop.ipc.Server: IPC Server handler 25 on 
default port 8020, call#9866 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol. create from 10.32.20.25:35130: 
org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException:
 The directory item limit of /user/gobblin/loghost/failures/dot_ma/undefined is 
exceeded: limit=1048576 items=1048576

 

 

Which leads me to the conclusion that in this case the NO_REQUIRED_STORAGE_TYPE 
errors are raised due to reaching the limit specified in property 
dfs.namenode.fs-limits.max-directory-items. Perhaps they should be restricted 
as they provide no information and actually report a non-existent problem.

 

Additionally, immediately after clearing the 
/user/gobblin/loghost/failures/dot_ma/undefined directory, the 
NO_REQUIRED_STORAGE_TYPE messages stopped appearing.

 

---
I would also like to take this opportunity to ask where to find any list, 
specifying meaning of values used in the NO_REQUIRED_STORAGE_TYPE=1 messages 
(in this case, 1)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17547) debug verifyEC check error

2024-06-07 Thread ruiliang (Jira)
ruiliang created HDFS-17547:
---

 Summary: debug verifyEC check error
 Key: HDFS-17547
 URL: https://issues.apache.org/jira/browse/HDFS-17547
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-common
Reporter: ruiliang


When I validate a block that has been corrupted many times, does it appear 
normal?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

 

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]

buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ?

outputs = {ByteBuffer[2]@3271} 
 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be judged as an anomaly?

 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17546) Implementing Timeout for HostFileReader when FS hangs

2024-06-06 Thread Simbarashe Dzinamarira (Jira)
Simbarashe Dzinamarira created HDFS-17546:
-

 Summary: Implementing Timeout for HostFileReader when FS hangs
 Key: HDFS-17546
 URL: https://issues.apache.org/jira/browse/HDFS-17546
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Simbarashe Dzinamarira
Assignee: Simbarashe Dzinamarira


Certain implementations of Hadoop have the dfs.hosts file residing on NAS/NFS 
and potentially with symlinks. If the FS hangs for any reason, the refreshNodes 
call would infinitely hang on the HostsFileReader until the FS returns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17545) [ARR] router async rpc client.

2024-06-06 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17545:
-

 Summary: [ARR] router async rpc client.
 Key: HDFS-17545
 URL: https://issues.apache.org/jira/browse/HDFS-17545
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jian Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17544) [ARR] The router client rpc protocol supports asynchrony.

2024-06-06 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17544:
-

 Summary: [ARR] The router client rpc protocol supports asynchrony.
 Key: HDFS-17544
 URL: https://issues.apache.org/jira/browse/HDFS-17544
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Jian Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17542) EC: Optimize the EC block reconstruction.

2024-06-06 Thread Chenyu Zheng (Jira)
Chenyu Zheng created HDFS-17542:
---

 Summary: EC: Optimize the EC block reconstruction.
 Key: HDFS-17542
 URL: https://issues.apache.org/jira/browse/HDFS-17542
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chenyu Zheng
Assignee: Chenyu Zheng


The current reconstruction process of EC blocks is based on the original 
contiguous blocks. It is mainly implemented through the work constructed by 
computeReconstructionWorkForBlocks. It can be roughly divided into three 
processes:
 * scheduleReconstruction
 * chooseTargets
 * validateReconstructionWork

For ordinary contiguous blocks:

* (1) scheduleReconstruction

Select srcNodes as the source of the copy block according to the status of each 
replica of the block. 

* (2) chooseTargets

Select the target of the copy.

* (3) validateReconstructionWork

Add the copy command to srcNode, srcNode receives the command through 
heartbeat, and executes the block copy from src to target.

For EC blocks:
(1) and (2) are nearly same. However, in (3), block copying or block 
reconstruction may occur, or no work may be generated, such as when some 
storage are busy. If no work is generated, it will lead to the problem 
described in HDFS-17516. Even if no block copying or block reconstruction is 
generated, pendingReconstruction and neededReconstruction will still be updated 
until the block times out, which wastes the scheduling opportunity.
In order to be compatible with the original contiguous blocks and decide the 
specific action in (3), unnecessary liveBlockIndices, liveBusyBlockIndices, and 
excludeReconstructedIndices are introduced. We know many bug is related here. 
These can be avoided.

Improvements:
* Move the work of deciding whether to copy or reconstruct blocks from (3) to 
(1).

Such improvements are more conducive to implementing the explicit specification 
of the reconstruction block index mentioned in HDFS-16874, and do not need to 
pass liveBlockIndices, liveBusyBlockIndice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17541) Support msync requests to a separate RPC server for the active NameNode

2024-06-05 Thread Liangjun He (Jira)
Liangjun He created HDFS-17541:
--

 Summary: Support msync requests to a separate RPC server for the 
active NameNode
 Key: HDFS-17541
 URL: https://issues.apache.org/jira/browse/HDFS-17541
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Liangjun He
Assignee: Liangjun He






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17533) RBF: Unit tests that use embedded SQL failing in CI

2024-06-03 Thread Simbarashe Dzinamarira (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simbarashe Dzinamarira resolved HDFS-17533.
---
Resolution: Fixed

> RBF: Unit tests that use embedded SQL failing in CI
> ---
>
> Key: HDFS-17533
> URL: https://issues.apache.org/jira/browse/HDFS-17533
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>
> In the CI runs for RBF the following two tests are failing
> {noformat}
> [ERROR] Failures: 
> [ERROR] 
> org.apache.hadoop.hdfs.server.federation.router.security.token.TestSQLDelegationTokenSecretManagerImpl.null
> [ERROR]   Run 1: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 
> failures)
>   java.sql.SQLException: No suitable driver found for 
> jdbc:derby:memory:TokenStore;create=true
>   java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
> found for jdbc:derby:memory:TokenStore;drop=true
> [ERROR]   Run 2: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 
> failures)
>   java.sql.SQLException: No suitable driver found for 
> jdbc:derby:memory:TokenStore;create=true
>   java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
> found for jdbc:derby:memory:TokenStore;drop=true
> [ERROR]   Run 3: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 
> failures)
>   java.sql.SQLException: No suitable driver found for 
> jdbc:derby:memory:TokenStore;create=true
>   java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
> found for jdbc:derby:memory:TokenStore;drop=true
> [INFO] 
> [ERROR] 
> org.apache.hadoop.hdfs.server.federation.store.driver.TestStateStoreMySQL.null
> [ERROR]   Run 1: TestStateStoreMySQL Multiple Failures (2 failures)
>   java.sql.SQLException: No suitable driver found for 
> jdbc:derby:memory:StateStore;create=true
>   java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
> found for jdbc:derby:memory:StateStore;drop=true
> [ERROR]   Run 2: TestStateStoreMySQL Multiple Failures (2 failures)
>   java.sql.SQLException: No suitable driver found for 
> jdbc:derby:memory:StateStore;create=true
>   java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
> found for jdbc:derby:memory:StateStore;drop=true
> [ERROR]   Run 3: TestStateStoreMySQL Multiple Failures (2 failures)
>   java.sql.SQLException: No suitable driver found for 
> jdbc:derby:memory:StateStore;create=true
>   java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
> found for jdbc:derby:memory:StateStore;drop=true {noformat}
> [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6804/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt]
>  
> I believe the fix is first registering the driver: 
> [https://dev.mysql.com/doc/connector-j/en/connector-j-usagenotes-connect-drivermanager.html]
> [https://stackoverflow.com/questions/22384710/java-sql-sqlexception-no-suitable-driver-found-for-jdbcmysql-localhost3306]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17538) Add tranfer priority queue for decommissioning datanode

2024-06-03 Thread Yuanbo Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu resolved HDFS-17538.
---
Resolution: Duplicate

> Add tranfer priority queue for decommissioning datanode
> ---
>
> Key: HDFS-17538
> URL: https://issues.apache.org/jira/browse/HDFS-17538
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Yuanbo Liu
>Priority: Major
> Attachments: image-2024-05-29-16-24-45-601.png, 
> image-2024-05-29-16-26-58-359.png, image-2024-05-29-16-27-35-886.png
>
>
> When decommissioning datanode, blocks will be checked one by one disk, then 
> blocks will be sent to trigger tranfer works in DN. This will make one disk 
> of decommissioning dn very busy and cpus stuck in io-wait with high loads, 
> and sometime even lead to OOM as below:
> !image-2024-05-29-16-24-45-601.png|width=909,height=170!
> !image-2024-05-29-16-26-58-359.png|width=909,height=228!
> !image-2024-05-29-16-27-35-886.png|width=930,height=218!
> Proposal to add priority queue for transfering blocks when decommisioning 
> datanode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17540) Namenode retry to warmup EDEK cache forever

2024-05-31 Thread Yu Zhang (Jira)
Yu Zhang created HDFS-17540:
---

 Summary: Namenode retry to warmup EDEK cache forever 
 Key: HDFS-17540
 URL: https://issues.apache.org/jira/browse/HDFS-17540
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: encryption, namenode
Affects Versions: 2.8.0
Reporter: Yu Zhang


https://issues.apache.org/jira/browse/HDFS-9405 adds a background thread to 
pre-warm EDEK cache. 

However this fails and retries continuously if key retrieval fails for one 
encryption zone. In our usecase, we have temporarily removed keys for certain 
encryption zones.  Currently namenode and kms log is filled up with errors 
related to background thread retrying warmup for ever .

The pre-warm thread should
 * Continue to refresh other encryption zones even if it fails for one
 * Should retry only if it fails for all encryption zones, which will be the 
case when kms is down.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17539) TestFileChecksum should not spin up a MiniDFSCluster for every test

2024-05-30 Thread Felix N (Jira)
Felix N created HDFS-17539:
--

 Summary: TestFileChecksum should not spin up a MiniDFSCluster for 
every test
 Key: HDFS-17539
 URL: https://issues.apache.org/jira/browse/HDFS-17539
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Felix N
Assignee: Felix N


TestFileChecksum has 34 tests. Add its brother the parameterized COMPOSITE_CRC 
version and that's 68 times a cluster is spun up then shutdown when twice is 
necessary (or maybe even once but 2 is not too bad).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17538) Add tranfer priority queue for decommissioning datanode

2024-05-29 Thread Yuanbo Liu (Jira)
Yuanbo Liu created HDFS-17538:
-

 Summary: Add tranfer priority queue for decommissioning datanode
 Key: HDFS-17538
 URL: https://issues.apache.org/jira/browse/HDFS-17538
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Yuanbo Liu
 Attachments: image-2024-05-29-16-24-45-601.png, 
image-2024-05-29-16-26-58-359.png, image-2024-05-29-16-27-35-886.png

When decommissioning datanode, blocks will be checked one by one disk, then 
blocks will be sent to trigger tranfer works in DN. This will make one disk of 
decommissioning dn very busy and cpus stuck in io-wait with high loads, and 
sometime even lead to OOM as below:

!image-2024-05-29-16-24-45-601.png!

!image-2024-05-29-16-26-58-359.png!

!image-2024-05-29-16-27-35-886.png!

Proposal to add priority queue for transfering blocks when decommisioning 
datanode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17532) RBF: Allow router state store cache update to overwrite and delete in parallel

2024-05-27 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-17532.
-
Hadoop Flags: Reviewed
  Resolution: Fixed

> RBF: Allow router state store cache update to overwrite and delete in parallel
> --
>
> Key: HDFS-17532
> URL: https://issues.apache.org/jira/browse/HDFS-17532
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, rbf
>Reporter: Felix N
>Assignee: Felix N
>Priority: Minor
>  Labels: pull-request-available
>
> Current implementation for router state store update is quite inefficient, so 
> much that when routers are removed and a lot of NameNodeMembership records 
> are deleted in a short burst, the deletions triggered a router safemode in 
> our cluster and caused a lot of troubles.
> This ticket aims to allow the overwrite part and delete part of 
> org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
>  to run in parallel.
> See HDFS-17529 for the other half of this improvement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17537) RBF : Last block report is incorrect in federationhealth.html

2024-05-27 Thread Ananya Singh (Jira)
Ananya Singh created HDFS-17537:
---

 Summary: RBF : Last block report is incorrect in 
federationhealth.html
 Key: HDFS-17537
 URL: https://issues.apache.org/jira/browse/HDFS-17537
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, rbf
Affects Versions: 3.3.6
Reporter: Ananya Singh
Assignee: Ananya Singh






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17536) RBF: Format safe-mode related logic and fix a race

2024-05-24 Thread ZanderXu (Jira)
ZanderXu created HDFS-17536:
---

 Summary: RBF: Format safe-mode related logic and fix a race 
 Key: HDFS-17536
 URL: https://issues.apache.org/jira/browse/HDFS-17536
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: ZanderXu
Assignee: ZanderXu


RBF: Format safe-mode related logic and fix a race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17535) I have confirmed the EC corrupt file, can this corrupt file be restored?

2024-05-24 Thread ruiliang (Jira)
ruiliang created HDFS-17535:
---

 Summary: I have confirmed the EC corrupt file, can this corrupt 
file be restored?
 Key: HDFS-17535
 URL: https://issues.apache.org/jira/browse/HDFS-17535
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ec, hdfs
Affects Versions: 3.1.0
Reporter: ruiliang


I learned that EC does have a major bug with file corrupt
https://issues.apache.org/jira/browse/HDFS-15759


1:I have confirmed the EC corrupt file, can this corrupt file be restored?
Have important data that is causing us production data loss issues?   Is there 
a way to recover
corrupt;/file;corrupt block groups \{blk_-xx} zeroParityBlockGroups 
\{blk_-xx[blk_-xx]}

2:https://github.com/apache/orc/issues/1939 I was wondering if cherry picked 
your current code (GitHub pull request #2869), Can I skip patches related to 
HDFS-14768,HDFS-15186, and HDFS-15240?


hdfs  version 3.1.0

thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17529) RBF: Improve router state store cache entry deletion

2024-05-23 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-17529.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: Improve router state store cache entry deletion
> 
>
> Key: HDFS-17529
> URL: https://issues.apache.org/jira/browse/HDFS-17529
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, rbf
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Current implementation for router state store update is quite inefficient, so 
> much that when routers are removed and a lot of NameNodeMembership records 
> are deleted in a short burst, the deletions triggered a router safemode in 
> our cluster and caused a lot of troubles.
> This ticket aims to improve the deletion process for ZK state store 
> implementation.
> See HDFS-17532 for the other half of this improvement



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17530) Aynchronous router

2024-05-21 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri resolved HDFS-17530.

Resolution: Duplicate

> Aynchronous router
> --
>
> Key: HDFS-17530
> URL: https://issues.apache.org/jira/browse/HDFS-17530
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Jian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17464) Improve some logs output in class FsDatasetImpl

2024-05-20 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu resolved HDFS-17464.
---
Fix Version/s: 3.5.0
   Resolution: Resolved

> Improve some logs output in class FsDatasetImpl
> ---
>
> Key: HDFS-17464
> URL: https://issues.apache.org/jira/browse/HDFS-17464
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17533) RBF Tests that use embedded SQL failing unit tests

2024-05-20 Thread Simbarashe Dzinamarira (Jira)
Simbarashe Dzinamarira created HDFS-17533:
-

 Summary: RBF Tests that use embedded SQL failing unit tests
 Key: HDFS-17533
 URL: https://issues.apache.org/jira/browse/HDFS-17533
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Simbarashe Dzinamarira


In the CI runs for RBF the following two tests are failing
{noformat}
[ERROR] Failures: 
[ERROR] 
org.apache.hadoop.hdfs.server.federation.router.security.token.TestSQLDelegationTokenSecretManagerImpl.null
[ERROR]   Run 1: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 
failures)
java.sql.SQLException: No suitable driver found for 
jdbc:derby:memory:TokenStore;create=true
java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
found for jdbc:derby:memory:TokenStore;drop=true
[ERROR]   Run 2: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 
failures)
java.sql.SQLException: No suitable driver found for 
jdbc:derby:memory:TokenStore;create=true
java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
found for jdbc:derby:memory:TokenStore;drop=true
[ERROR]   Run 3: TestSQLDelegationTokenSecretManagerImpl Multiple Failures (2 
failures)
java.sql.SQLException: No suitable driver found for 
jdbc:derby:memory:TokenStore;create=true
java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
found for jdbc:derby:memory:TokenStore;drop=true
[INFO] 
[ERROR] 
org.apache.hadoop.hdfs.server.federation.store.driver.TestStateStoreMySQL.null
[ERROR]   Run 1: TestStateStoreMySQL Multiple Failures (2 failures)
java.sql.SQLException: No suitable driver found for 
jdbc:derby:memory:StateStore;create=true
java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
found for jdbc:derby:memory:StateStore;drop=true
[ERROR]   Run 2: TestStateStoreMySQL Multiple Failures (2 failures)
java.sql.SQLException: No suitable driver found for 
jdbc:derby:memory:StateStore;create=true
java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
found for jdbc:derby:memory:StateStore;drop=true
[ERROR]   Run 3: TestStateStoreMySQL Multiple Failures (2 failures)
java.sql.SQLException: No suitable driver found for 
jdbc:derby:memory:StateStore;create=true
java.lang.RuntimeException: java.sql.SQLException: No suitable driver 
found for jdbc:derby:memory:StateStore;drop=true {noformat}
[https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6804/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt]

 

I believe the fix is first registering the driver: 
[https://dev.mysql.com/doc/connector-j/en/connector-j-usagenotes-connect-drivermanager.html]

 

[https://stackoverflow.com/questions/22384710/java-sql-sqlexception-no-suitable-driver-found-for-jdbcmysql-localhost3306]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17532) Allow router state store cache update to overwrite and delete in parallel

2024-05-20 Thread Felix N (Jira)
Felix N created HDFS-17532:
--

 Summary: Allow router state store cache update to overwrite and 
delete in parallel
 Key: HDFS-17532
 URL: https://issues.apache.org/jira/browse/HDFS-17532
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, rbf
Reporter: Felix N
Assignee: Felix N


Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket aims to allow the overwrite part and delete part of 
org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
 to run in parallel.

See HDFS-17529 for the other half of this improvement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17531) RBF: Aynchronous router RPC.

2024-05-19 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17531:
-

 Summary: RBF: Aynchronous router RPC.
 Key: HDFS-17531
 URL: https://issues.apache.org/jira/browse/HDFS-17531
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jian Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17530) Aynchronous router

2024-05-19 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17530:
-

 Summary: Aynchronous router
 Key: HDFS-17530
 URL: https://issues.apache.org/jira/browse/HDFS-17530
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Jian Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17529) Improve router state store cache update

2024-05-17 Thread Felix N (Jira)
Felix N created HDFS-17529:
--

 Summary: Improve router state store cache update
 Key: HDFS-17529
 URL: https://issues.apache.org/jira/browse/HDFS-17529
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, rbf
Reporter: Felix N
Assignee: Felix N


Current implementation for router state store update is quite inefficient, so 
much that when routers are removed and a lot of NameNodeMembership records are 
deleted in a short burst, the deletions triggered a router safemode in our 
cluster and caused a lot of troubles.

This ticket contains 2 parts: improving the deletion process for ZK state store 
implementation, and allowing the overwrite part and delete part of



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17509) RBF: Fix ClientProtocol.concat will throw NPE if tgr is a empty file.

2024-05-16 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-17509.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

> RBF: Fix ClientProtocol.concat  will throw NPE if tgr is a empty file.
> --
>
> Key: HDFS-17509
> URL: https://issues.apache.org/jira/browse/HDFS-17509
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> hdfs dfs -concat  /tmp/merge /tmp/t1 /tmp/t2
> When /tmp/merge is a empty file, this command will throw NPE via DFSRouter. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17520) TestDFSAdmin.testAllDatanodesReconfig and TestDFSAdmin.testDecommissionDataNodesReconfig failed

2024-05-14 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved HDFS-17520.
---
   Fix Version/s: 3.4.1
  3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.4.1, 3.5.0
  Resolution: Fixed

> TestDFSAdmin.testAllDatanodesReconfig and 
> TestDFSAdmin.testDecommissionDataNodesReconfig failed
> ---
>
> Key: HDFS-17520
> URL: https://issues.apache.org/jira/browse/HDFS-17520
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
>
> {code:java}
> [ERROR] Tests run: 21, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 
> 44.521 s <<< FAILURE! - in org.apache.hadoop.hdfs.tools.TestDFSAdmin
> [ERROR] testAllDatanodesReconfig(org.apache.hadoop.hdfs.tools.TestDFSAdmin)  
> Time elapsed: 2.086 s  <<< FAILURE!
> java.lang.AssertionError: 
> Expecting:
>  <["Reconfiguring status for node [127.0.0.1:43731]: SUCCESS: Changed 
> property dfs.datanode.peer.stats.enabled",
> " From: "false"",
> " To: "true"",
> "started at Fri May 10 13:02:51 UTC 2024 and finished at Fri May 10 
> 13:02:51 UTC 2024."]>
> to contain subsequence:
>  <["SUCCESS: Changed property dfs.datanode.peer.stats.enabled",
> " From: "false"",
> " To: "true""]>
>   at 
> org.apache.hadoop.hdfs.tools.TestDFSAdmin.testAllDatanodesReconfig(TestDFSAdmin.java:1286)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>   at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>   at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>   at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17528) FsImageValidation: set txid when saving a new image

2024-05-14 Thread Tsz-wo Sze (Jira)
Tsz-wo Sze created HDFS-17528:
-

 Summary: FsImageValidation: set txid when saving a new image
 Key: HDFS-17528
 URL: https://issues.apache.org/jira/browse/HDFS-17528
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Tsz-wo Sze






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17527) RBF: Routers should not allow observer reads when namenode stateId context is disabled

2024-05-14 Thread Simbarashe Dzinamarira (Jira)
Simbarashe Dzinamarira created HDFS-17527:
-

 Summary: RBF: Routers should not allow observer reads when 
namenode stateId context is disabled
 Key: HDFS-17527
 URL: https://issues.apache.org/jira/browse/HDFS-17527
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Simbarashe Dzinamarira






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17514) RBF: Routers keep using cached stateID even when active NN returns unset header

2024-05-14 Thread Simbarashe Dzinamarira (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simbarashe Dzinamarira resolved HDFS-17514.
---
Resolution: Fixed

> RBF: Routers keep using cached stateID even when active NN returns unset 
> header
> ---
>
> Key: HDFS-17514
> URL: https://issues.apache.org/jira/browse/HDFS-17514
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Minor
>  Labels: pull-request-available
>
> When a namenode that had "dfs.namenode.state.context.enabled" set to true is 
> restarted with the configuration set to false, routers will keep using a 
> previously cached state ID.
> Without RBF
> * clients that fetched the old stateID could have stale reads even after 
> msyncing
> * new clients will go to the active.
> With RBF
> * client that fetched the old stateID could have stale reads like above.
> * New clients will also fetch the stale stateID and potentially have stale 
> reads
> New clients that are created after the restart should not fetch the stale 
> state ID.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17099) Fix Null Pointer Exception when stop namesystem in HDFS

2024-05-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17099.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix Null Pointer Exception when stop namesystem in HDFS
> ---
>
> Key: HDFS-17099
> URL: https://issues.apache.org/jira/browse/HDFS-17099
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Assignee: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: reproduce.sh
>
>
> h2. What happend:
> Got NullPointerException when stop namesystem in HDFS.
> h2. Buggy code:
>  
> {code:java}
>   void stopActiveServices() {
>     ...
>     if (dir != null && getFSImage() != null) {
>       if (getFSImage().editLog != null) {    // <--- Check whether editLog is 
> null
>         getFSImage().editLog.close();
>       }
>       // Update the fsimage with the last txid that we wrote
>       // so that the tailer starts from the right spot.
>       getFSImage().updateLastAppliedTxIdFromWritten(); // <--- BUG: Even if 
> editLog is null, this line will still be executed and cause nullpointer 
> exception
>     }
>     ...
>   }  public void updateLastAppliedTxIdFromWritten() {
>     this.lastAppliedTxId = editLog.getLastWrittenTxId();  // < This will 
> cause nullpointer exception if editLog is null
>   } {code}
> h2. StackTrace:
>  
> {code:java}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.updateLastAppliedTxIdFromWritten(FSImage.java:1553)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:1463)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.close(FSNamesystem.java:1815)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:1017)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:248)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:194)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:181)
>  {code}
> h2. How to reproduce:
> (1) Set {{dfs.namenode.top.windows.minutes}} to {{{}37914516,32,0{}}}; or set 
> {{dfs.namenode.top.window.num.buckets}} to {{{}244111242{}}}.
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.namenode.TestNameNodeHttpServerXFrame#testSecondaryNameNodeXFrame}}
> h2. What's more:
> I'm still investigating how the parameter 
> {{dfs.namenode.top.windows.minutes}} triggered the buggy code.
>  
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >