[jira] [Resolved] (HDFS-16690) Automatically format new unformatted JournalNodes using JournalNodeSyncer

2024-07-23 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16690.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Automatically format new unformatted JournalNodes using JournalNodeSyncer 
> --
>
> Key: HDFS-16690
> URL: https://issues.apache.org/jira/browse/HDFS-16690
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node
>Affects Versions: 3.4.0, 3.3.5
> Environment: Demonstrated in a Kubernetes environment running Java 11.
>  # Start new cluster, but short 1 JN (minimum quorum, and the missing JN 
> won’t resolve). VERIFY:
>  - NN formats the 2 existing JN and stabilizes.  NOTE: Formatting using just 
> a quorum will be a separate submission
>  - Messages show sync between JN-0 and JN-1, and NN -> JN.
>  # Scale JN stateful set to add missing JN. VERIFY:
>  - New JN starts
>  - All other JN and all NN report IP address change (IP Address resolution).  
> NOTE: require HADOOP-18365 and HDFS-16688
>  - Messages show sync between all JN, and NN -> JN
>  - New JN is formatted at least once (possibly by multiple other JN)
>  - New JN storage directory is formatted only once
>  - New JN joins cluster (lastWriterEpoch is non-zero)
>Reporter: Steve Vaughan
>Assignee: Aswin M Prabhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> If an unformatted JournalNode is added to an existing JournalNode set, 
> instances of the JournalNodeSyncer are unable to sync to the new node.  When 
> a sync receives a JournalNotFormattedException, we can initiate a format 
> operation, and then retry the synchronization.
> Conceptually this means that the JournalNodes and their data can be managed 
> independently from the rest of the system, as the JournalNodes will 
> incorporate new JournalNode instances.  Once the new JournalNode is 
> formatted, it can participate in shared edits from the NameNodes. 
> I've been testing an update to the InterQJournalProtocol to add a format call 
> like that used by the NameNode.  Current tests include starting an HA cluster 
> from scratch, but with 2 JournalNode instances.  Once the cluster is up, I 
> can add the 3rd JournalNode (which is unformatted), and the other 2 
> JournalNodes will eventually attempt to sync which results in a formatting 
> and subsequent sync.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17566) Got wrong sorted block order when StorageType is considered.

2024-07-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17566.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Got wrong sorted block order when StorageType is considered.
> 
>
> Key: HDFS-17566
> URL: https://issues.apache.org/jira/browse/HDFS-17566
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> I found unit test failures like below:
> ```
> [ERROR] Tests run: 17, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 9.146 s <<< FAILURE! - in 
> org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager
> [ERROR] 
> testGetBlockLocationConsiderStorageType(org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager)
>   Time elapsed: 0.206 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but was:
>     at org.junit.Assert.assertEquals(Assert.java:117)
>     at org.junit.Assert.assertEquals(Assert.java:146)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeManager.testGetBlockLocationConsiderStorageType(TestDatanodeManager.java:769)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>     at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>     at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>     at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>     at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>     at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>     at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>     at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
>     at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
>     at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
>     at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
>     at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>     at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>     at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>     at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>     at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>     at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>     at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>     at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> ```
>  
> The reason is that in HDFS-17098 comparator order is wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17564) EC: Fix the issue of inaccurate metrics when decommission mark busy DN

2024-07-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17564.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix the issue of inaccurate metrics when decommission mark busy DN
> --
>
> Key: HDFS-17564
> URL: https://issues.apache.org/jira/browse/HDFS-17564
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> If DataNode is marked as busy and contains many EC blocks, when running 
> decommission DataNode, when execute ErasureCodingWork#addTaskToDatanode, here 
> will no replication work will be generated for ecBlocksToBeReplicated, but 
> related metrics (such as DatanodeDescriptor#currApproxBlocksScheduled, 
> pendingReconstruction and needReconstruction) will still updated.
> *Specific code:*
> BlockManager#scheduleReconstruction -> BlockManager#chooseSourceDatanodes 
> [2628~2650] 
> If DataNode is marked as busy and contains many EC blocks here will not add 
> to srcNodes.
> .
> {code:java}
> @VisibleForTesting
> DatanodeDescriptor[] chooseSourceDatanodes(BlockInfo block,
> List containingNodes,
> List nodesContainingLiveReplicas,
> NumberReplicas numReplicas, List liveBlockIndices,
> List liveBusyBlockIndices, List excludeReconstructed, int 
> priority) {
>   containingNodes.clear();
>   nodesContainingLiveReplicas.clear();
>   List srcNodes = new ArrayList<>();
>  ...
>   for (DatanodeStorageInfo storage : blocksMap.getStorages(block)) {
> final DatanodeDescriptor node = getDatanodeDescriptorFromStorage(storage);
> final StoredReplicaState state = checkReplicaOnStorage(numReplicas, block,
> storage, corruptReplicas.getNodes(block), false);
> ...
> // for EC here need to make sure the numReplicas replicates state correct
> // because in the scheduleReconstruction it need the numReplicas to check
> // whether need to reconstruct the ec internal block
> byte blockIndex = -1;
> if (isStriped) {
>   blockIndex = ((BlockInfoStriped) block)
>   .getStorageBlockIndex(storage);
>   countLiveAndDecommissioningReplicas(numReplicas, state,
>   liveBitSet, decommissioningBitSet, blockIndex);
> }
> if (priority != LowRedundancyBlocks.QUEUE_HIGHEST_PRIORITY
> && (!node.isDecommissionInProgress() && !node.isEnteringMaintenance())
> && node.getNumberOfBlocksToBeReplicated() +
> node.getNumberOfBlocksToBeErasureCoded() >= maxReplicationStreams) {
>   if (isStriped && (state == StoredReplicaState.LIVE
> || state == StoredReplicaState.DECOMMISSIONING)) {
> liveBusyBlockIndices.add(blockIndex);
> //HDFS-16566 ExcludeReconstructed won't be reconstructed.
> excludeReconstructed.add(blockIndex);
>   }
>   continue; // already reached replication limit
> }
> if (node.getNumberOfBlocksToBeReplicated() +
> node.getNumberOfBlocksToBeErasureCoded() >= 
> replicationStreamsHardLimit) {
>   if (isStriped && (state == StoredReplicaState.LIVE
> || state == StoredReplicaState.DECOMMISSIONING)) {
> liveBusyBlockIndices.add(blockIndex);
> //HDFS-16566 ExcludeReconstructed won't be reconstructed.
> excludeReconstructed.add(blockIndex);
>   }
>   continue;
> }
> if(isStriped || srcNodes.isEmpty()) {
>   srcNodes.add(node);
>   if (isStriped) {
> liveBlockIndices.add(blockIndex);
>   }
>   continue;
> }
>...
> {code}
> ErasureCodingWork#addTaskToDatanode[149~157]
> {code:java}
> @Override
> void addTaskToDatanode(NumberReplicas numberReplicas) {
>   final DatanodeStorageInfo[] targets = getTargets();
>   assert targets.length > 0;
>   BlockInfoStriped stripedBlk = (BlockInfoStriped) getBlock();
>   ...
>   } else if ((numberReplicas.decommissioning() > 0 ||
>   numberReplicas.liveEnteringMaintenanceReplicas() > 0) &&
>   hasAllInternalBlocks()) {
> List leavingServiceSources = findLeavingServiceSources();
> // decommissioningSources.size() should be >= targets.length
> // if the leavingServiceSources size is 0,  here will not to 
> createReplicationWork
> final int num = Math.min(leavingServiceSources.size(), targets.length);
> for (int i = 0; i < num; i++) {
>   createReplicationWork(leavingServiceSources.get(i), targets[i]);
> }
>   ...
> }
> // Since there is no decommission busy datanode in srcNodes, here return the 
> set size of srcIndices as 0.
> private List findLeavingServiceSources() {
> // Mark the block in normal node.
> BlockInfoStriped block = (BlockInfoStriped)getBlock();
> BitSet bitSet = new 

[jira] [Resolved] (HDFS-17099) Fix Null Pointer Exception when stop namesystem in HDFS

2024-05-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17099.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix Null Pointer Exception when stop namesystem in HDFS
> ---
>
> Key: HDFS-17099
> URL: https://issues.apache.org/jira/browse/HDFS-17099
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Assignee: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: reproduce.sh
>
>
> h2. What happend:
> Got NullPointerException when stop namesystem in HDFS.
> h2. Buggy code:
>  
> {code:java}
>   void stopActiveServices() {
>     ...
>     if (dir != null && getFSImage() != null) {
>       if (getFSImage().editLog != null) {    // <--- Check whether editLog is 
> null
>         getFSImage().editLog.close();
>       }
>       // Update the fsimage with the last txid that we wrote
>       // so that the tailer starts from the right spot.
>       getFSImage().updateLastAppliedTxIdFromWritten(); // <--- BUG: Even if 
> editLog is null, this line will still be executed and cause nullpointer 
> exception
>     }
>     ...
>   }  public void updateLastAppliedTxIdFromWritten() {
>     this.lastAppliedTxId = editLog.getLastWrittenTxId();  // < This will 
> cause nullpointer exception if editLog is null
>   } {code}
> h2. StackTrace:
>  
> {code:java}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.updateLastAppliedTxIdFromWritten(FSImage.java:1553)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:1463)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.close(FSNamesystem.java:1815)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:1017)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:248)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:194)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:181)
>  {code}
> h2. How to reproduce:
> (1) Set {{dfs.namenode.top.windows.minutes}} to {{{}37914516,32,0{}}}; or set 
> {{dfs.namenode.top.window.num.buckets}} to {{{}244111242{}}}.
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.namenode.TestNameNodeHttpServerXFrame#testSecondaryNameNodeXFrame}}
> h2. What's more:
> I'm still investigating how the parameter 
> {{dfs.namenode.top.windows.minutes}} triggered the buggy code.
>  
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17522) JournalNode web interfaces lack configs for X-FRAME-OPTIONS protection

2024-05-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17522.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> JournalNode web interfaces lack configs for X-FRAME-OPTIONS protection
> --
>
> Key: HDFS-17522
> URL: https://issues.apache.org/jira/browse/HDFS-17522
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node
>Affects Versions: 3.0.0-alpha1, 3.5.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> [HDFS-10579 |https://issues.apache.org/jira/browse/HDFS-10579] has added 
> protection for NameNode and DataNode, but missing protection for JournalNode 
> web interfaces.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17500) Add missing operation name while authorizing some operations

2024-05-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17500.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add missing operation name while authorizing some operations
> 
>
> Key: HDFS-17500
> URL: https://issues.apache.org/jira/browse/HDFS-17500
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Abhay
>Assignee: Abhay
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Currently, operation name is set to null in the FSPermissionChecker when 
> authorizing 'create' and 'completeFile' operations. It may help the 
> authorizer to optimize the authorization processing if the operation names 
> are set correctly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17456) Fix the incorrect dfsused statistics of datanode when appending a file.

2024-04-29 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17456.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix the incorrect dfsused statistics of datanode when appending a file.
> ---
>
> Key: HDFS-17456
> URL: https://issues.apache.org/jira/browse/HDFS-17456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.3.3
>Reporter: fuchaohong
>Assignee: fuchaohong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In our production env, the namenode page showed that the datanode space had 
> been used up, but the actual datanode machine still had a lot of free space. 
> After troubleshooting, the dfsused statistics of datanode are incorrect when 
> appending a file. The following is the dfsused after each append of 100.
> |*Error*|*Expect*|
> |0|0|
> |100|100|
> |300|200|
> |600|300|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17461) Fix spotbugs in PeerCache#getInternal

2024-04-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17461.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix spotbugs in PeerCache#getInternal
> -
>
> Key: HDFS-17461
> URL: https://issues.apache.org/jira/browse/HDFS-17461
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Fix spotbugs in PeerCache#getInternal 
> Spotbugs warnings:
> https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6710/4/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17455) Fix Client throw IndexOutOfBoundsException in DFSInputStream#fetchBlockAt

2024-04-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17455.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix Client throw IndexOutOfBoundsException in DFSInputStream#fetchBlockAt
> -
>
> Key: HDFS-17455
> URL: https://issues.apache.org/jira/browse/HDFS-17455
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When the client read data, connect to the datanode, because at this time the 
> datanode access token is invalid will throw InvalidBlockTokenException. At 
> this time, when call fetchBlockAt method will  throw 
> java.lang.IndexOutOfBoundsException causing  read data failed.
> *Root case:*
> * The HDFS file contains only one RBW block, with a block data size of 2048KB.
> * The client open this file and seeks to the offset of 1024KB to read data.
> * Call DFSInputStream#getBlockReader method connect to the datanode,  because 
> at this time the datanode access token is invalid will throw 
> InvalidBlockTokenException., and call DFSInputStream#fetchBlockAt will throw 
> java.lang.IndexOutOfBoundsException.
> {code:java}
> private synchronized DatanodeInfo blockSeekTo(long target)
>  throws IOException {
>if (target >= getFileLength()) {
>// the target size is smaller than fileLength (completeBlockSize + 
> lastBlockBeingWrittenLength),
>// here at this time target is 1024 and getFileLength is 2048
>  throw new IOException("Attempted to read past end of file");
>}
>...
>while (true) {
>  ...
>  try {
>blockReader = getBlockReader(targetBlock, offsetIntoBlock,
>targetBlock.getBlockSize() - offsetIntoBlock, targetAddr,
>storageType, chosenNode);
>if(connectFailedOnce) {
>  DFSClient.LOG.info("Successfully connected to " + targetAddr +
> " for " + targetBlock.getBlock());
>}
>return chosenNode;
>  } catch (IOException ex) {
>...
>} else if (refetchToken > 0 && tokenRefetchNeeded(ex, targetAddr)) {
>  refetchToken--;
>  // Here will catch InvalidBlockTokenException.
>  fetchBlockAt(target);
>} else {
>  ...
>}
>  }
>}
>  }
> private LocatedBlock fetchBlockAt(long offset, long length, boolean useCache)
>   throws IOException {
> maybeRegisterBlockRefresh();
> synchronized(infoLock) {
>   // Here the locatedBlocks only contains one locatedBlock, at this time 
> the offset is 1024 and fileLength is 0,
>   // so the targetBlockIdx is -2
>   int targetBlockIdx = locatedBlocks.findBlock(offset);
>   if (targetBlockIdx < 0) { // block is not cached
> targetBlockIdx = LocatedBlocks.getInsertIndex(targetBlockIdx);
> // Here the targetBlockIdx is 1;
> useCache = false;
>   }
>   if (!useCache) { // fetch blocks
> final LocatedBlocks newBlocks = (length == 0)
> ? dfsClient.getLocatedBlocks(src, offset)
> : dfsClient.getLocatedBlocks(src, offset, length);
> if (newBlocks == null || newBlocks.locatedBlockCount() == 0) {
>   throw new EOFException("Could not find target position " + offset);
> }
> // Update the LastLocatedBlock, if offset is for last block.
> if (offset >= locatedBlocks.getFileLength()) {
>   setLocatedBlocksFields(newBlocks, getLastBlockLength(newBlocks));
> } else {
>   locatedBlocks.insertRange(targetBlockIdx,
>   newBlocks.getLocatedBlocks());
> }
>   }
>   // Here the locatedBlocks only contains one locatedBlock, so will throw 
> java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1
>   return locatedBlocks.get(targetBlockIdx);
> }
>   }
> {code}
> The client exception:
> {code:java}
> java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1
> at 
> java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
> at 
> java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
> at 
> java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
> at java.base/java.util.Objects.checkIndex(Objects.java:359)
> at java.base/java.util.ArrayList.get(ArrayList.java:427)
> at 
> org.apache.hadoop.hdfs.protocol.LocatedBlocks.get(LocatedBlocks.java:87)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockAt(DFSInputStream.java:569)
> at 
> 

[jira] [Resolved] (HDFS-17368) HA: Standy should exit safemode when resources are from low available

2024-03-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17368.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> HA: Standy should exit safemode when resources are from low available
> -
>
> Key: HDFS-17368
> URL: https://issues.apache.org/jira/browse/HDFS-17368
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zilong Zhu
>Assignee: Zilong Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> The NameNodeResourceMonitor automatically enters safemode when it detects 
> that the resources are not suffcient. NNRM is only in ANN. If both ANN and 
> SNN enter SM due to low resources, and later SNN's disk space is restored, 
> SNN willl become ANN and ANN will become SNN. However, at this point, SNN 
> will not exit the SM, even if the disk is recovered.
> Consider the following scenario:
>  * Initially, nn-1 is active and nn-2 is standby. The insufficient resources 
> of both nn-1 and nn-2 in dfs.namenode.name.dir, the NameNodeResourceMonitor 
> detects the resource issue and puts nn01 into safemode.
>  * At this point, nn-1 is in safemode (ON) and active, while nn-2 is in 
> safemode (OFF) and standby.
>  * After a period of time, the resources in nn-2's dfs.namenode.name.dir 
> recover, triggering failover.
>  * Now, nn-1 is in safe mode (ON) and standby, while nn-2 is in safe mode 
> (OFF) and active.
>  * Afterward, the resources in nn-1's dfs.namenode.name.dir recover.
>  * However, since nn-1 is standby but in safemode (ON), it unable to exit 
> safe mode automatically.
> There are two possible ways fix this issues:
>  # If SNN is detected to be in SM(because low resource), it will exit.
>  # Or we already have HDFS-17231, we can revert HDFS-2914. Bringing NNRM back 
> to SNN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17380) FsImageValidation: remove inaccessible nodes

2024-03-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17380.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> FsImageValidation: remove inaccessible nodes
> 
>
> Key: HDFS-17380
> URL: https://issues.apache.org/jira/browse/HDFS-17380
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> If a fsimage is corrupted,  it may have inaccessible nodes.  The 
> FsImageValidation tool currently is able to identify the inaccessible nodes 
> when validating the INodeMap.  This JIRA is to update the tool to remove the 
> inaccessible nodes and then save a new fsimage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17391) Adjust the checkpoint io buffer size to the chunk size

2024-03-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17391.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Adjust the checkpoint io buffer size to the chunk size
> --
>
> Key: HDFS-17391
> URL: https://issues.apache.org/jira/browse/HDFS-17391
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Adjust the checkpoint io buffer size to the chunk size to reduce checkpoint 
> time.
> Before change:
> 2022-07-11 07:10:50,900 INFO 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with 
> txid 374700896827 to namenode at http://:50070 in 1729.465 seconds
> After change:
> 2022-07-12 08:15:55,068 INFO 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with 
> txid 375717629244 to namenode at http://:50070  in 858.668 seconds



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17422) Enhance the stability of the unit test TestDFSAdmin

2024-03-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17422.

Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

>  Enhance the stability of the unit test TestDFSAdmin
> 
>
> Key: HDFS-17422
> URL: https://issues.apache.org/jira/browse/HDFS-17422
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> It has been observed that TestDFSAdmin frequently fails tests, such as 
> [PR-6620|https://github.com/apache/hadoop/pull/6620]. The failure occurs when 
> the test method testDecommissionDataNodesReconfig asserts the first line of 
> the standard output. The issue arises when the content being checked does not 
> appear on a single line. I believe we should change the method of testing. 
> The standard output content, which was printed in 
> [PR-6620|https://github.com/apache/hadoop/pull/6620], is as follows :
> {panel:title=TestInformation}
> 2024-03-11 02:36:19,442 [main] INFO  tools.TestDFSAdmin 
> (TestDFSAdmin.java:testDecommissionDataNodesReconfig(1356)) - 
> outsForFinishReconf first element is Reconfiguring status for node 
> [127.0.0.1:41361]: started at Mon Mar 11 02:36:18 UTC 2024 and finished at 
> Mon Mar 11 02:36:18 UTC 2024., all element is [Reconfiguring status for node 
> [127.0.0.1:41361]: started at Mon Mar 11 02:36:18 UTC 2024 and finished at 
> Mon Mar 11 02:36:18 UTC 2024., SUCCESS: Changed property 
> dfs.datanode.data.transfer.bandwidthPerSec,  From: "0",  To: "1000", 
> Reconfiguring status for node [127.0.0.1:33073]: started at Mon Mar 11 
> 02:36:18 UTC 2024 and finished at Mon Mar 11 02:36:18 UTC 2024., SUCCESS: 
> Changed property dfs.datanode.data.transfer.bandwidthPerSec,   From: "0", 
>  To: "1000", Retrieval of reconfiguration status successful on 2 nodes, 
> failed on 0 nodes.], node1Addr is 127.0.0.1:41361 , node2Addr is 
> 127.0.0.1:33073.
> {panel}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17182) DataSetLockManager.lockLeakCheck() is not thread-safe.

2024-01-03 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17182.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> DataSetLockManager.lockLeakCheck() is not thread-safe. 
> ---
>
> Key: HDFS-17182
> URL: https://issues.apache.org/jira/browse/HDFS-17182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> threadCountMap is not thread-safe. Other functions add protected by 
> synchronized expect   lockLeakCheck(). Add synchronized on function 
> lockLeakCheck().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17262) Transfer rate metric warning log is too verbose

2023-12-06 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17262.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Transfer rate metric warning log is too verbose
> ---
>
> Key: HDFS-17262
> URL: https://issues.apache.org/jira/browse/HDFS-17262
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> HDFS-16917 added a LOG.warn when passed duration is 0. The unit for duration 
> is millis, and its very possible for a read to take less than a millisecond 
> when considering local TCP connection. We are seeing this spam multiple times 
> per millisecond. There's another report on the PR for HDFS-16917.
> Please downgrade to debug or remove the log



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17270) RBF: Fix ZKDelegationTokenSecretManagerImpl use closed zookeeper client to get token in some case

2023-12-06 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17270.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: Fix ZKDelegationTokenSecretManagerImpl use closed zookeeper client  to 
> get token in some case
> --
>
> Key: HDFS-17270
> URL: https://issues.apache.org/jira/browse/HDFS-17270
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: CuratorFrameworkException
>
>
> Now, we use CuratorFramework to simplifies using ZooKeeper in 
> ZKDelegationTokenSecretManagerImpl and we always hold the same 
> zookeeperClient after initialization ZKDelegationTokenSecretManagerImpl. But 
> in some cases like network problem , CuratorFramework may close current 
> zookeeperClient and create new one. In this case , we will use  a zkclient 
> which has been closed  to get token. We encountered this situation in our 
> cluster,exception information in attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17250) EditLogTailer#triggerActiveLogRoll should handle thread Interrupted

2023-12-04 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17250.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EditLogTailer#triggerActiveLogRoll should handle thread Interrupted
> ---
>
> Key: HDFS-17250
> URL: https://issues.apache.org/jira/browse/HDFS-17250
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> *Issue:*
> When the NameNode attempts to trigger a log roll and the cachedActiveProxy is 
> a "the machine has been shut down of the namenode," it is unable to establish 
> a network connection. This results in a timeout during the socket connection 
> phase, which has a set timeout of 90 seconds. Since the asynchronous call for 
> "Triggering log roll" has a waiting time of 60 seconds, it triggers a timeout 
> and initiates a "cancel" operation, causing the executing thread to receive 
> an "Interrupted" signal and throwing a "java.io.InterruptedIOException" 
> exception.
> Currently, the logic not to handle interrupted signal, and the 
> "getActiveNodeProxy" method hasn't reached the maximum retry limit, the 
> overall execution process doesn't exit and it continues to attempt to
> call the "rollEditLog" on the next NameNode in the list. However when a 
> socket connection is established, it throws a 
> "java.nio.channels.ClosedByInterruptException" exception due to the thread 
> being in an "Interrupted" state.
> this cycle repeats until it reaches the maximum retry limit (nnCount * 
> maxRetries) will exits.
> However in the next cycle of "Triggering log roll," it continues to traverse 
> the NameNode list and encounters the same issue and the cachedActiveProxy is 
> still a "shut down NameNode."
> This eventually results in the NameNode being unable to successfully complete 
> the "Triggering log roll" operation.
> To optimize this, we need to handle the thread being interrupted and exit the 
> execution.
> *Detailed logs such as:*
> the Observer node "ob1" will execute "Triggering log roll" is as follows:
> nns list is [ob2(shut down machine),nn1(active),nn2(standy)]
> * The Observer node "ob1" periodically executes "triggerActiveLogRoll" and 
> asynchronously calls "getNameNodeProxy" to request "ob2" for the 
> "rollEditLog" operation, since the "ob2" machine is shut down, it cannot 
> establish a network connection, this results in a timeout during the socket 
> connection phase (here set timeout is 90 seconds).
> {quote}
> 2023-11-03 10:27:41,734 INFO  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(465)) [Edit log tailer] - Triggering 
> log roll on remote NameNode
> 2023-11-03 10:28:41,734 WARN  ha.EditLogTailer 
> (EditLogTailer.java:triggerActiveLogRoll(478)) [Edit log tailer] - Unable to 
> finish rolling edits in 6 ms
> {quote}
> * As the asynchronous call for "Triggering log roll" has a waiting time of 60 
> seconds, it triggers a timeout and initiates a "cancel" operation, causing 
> the executing thread to receive an "Interrupted" signal and will throw 
> "java.io.InterruptedIOException".
> {quote}
> 2023-11-03 10:28:41,753 WARN  ipc.Client 
> (Client.java:handleConnectionFailure(930)) [pool-33-thread-1] - Interrupted 
> while trying for connection
> 2023-11-03 10:28:41,972 WARN  ha.EditLogTailer (EditLogTailer.java:call(618)) 
> [pool-33-thread-1] - Exception from remote name node RemoteNameNodeInfo 
> [nnId=ob2, ipcAddress=xxx, httpAddress=http://xxx], try next.
> java.io.InterruptedIOException: DestHost:destPort xxx , LocalHost:localPort 
> xxx. Failed on local exception: java.io.InterruptedIOException: Interrupted 
> while waiting for IO on channel 
> java.nio.channels.SocketChannel[connection-pending remote=xxx:8040]. Total 
> timeout mills is 9, 30038 millis timeout left.
> at sun.reflect.GeneratedConstructorAccessor39.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:931)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:906)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1583)
> at org.apache.hadoop.ipc.Client.call(Client.java:1511)
> at org.apache.hadoop.ipc.Client.call(Client.java:1402)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:261)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:141)
>  

[jira] [Resolved] (HDFS-17218) NameNode should process time out excess redundancy blocks

2023-12-04 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17218.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> NameNode should process time out excess redundancy blocks
> -
>
> Key: HDFS-17218
> URL: https://issues.apache.org/jira/browse/HDFS-17218
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2023-10-12-15-52-52-336.png
>
>
> Currently found that DN will lose all pending DNA_INVALIDATE blocks if it 
> restarts.
> *Root case*
> Current DN enables asynchronously deletion, it have many pending deletion 
> blocks in memory.
> when DN restarts, these cached blocks may be lost. it causes some blocks in 
> the excess map in the namenode to be leaked and this will result in many 
> blocks having more replicas then expected.
> *solution*
> NameNode add logic to handle excess redundant block timeouts to resolve 
> current issue.
> If NN determines that the excess redundancy block in DN has timed out and 
> re-adds it to Invalidates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17063) Support to configure different capacity reserved for each disk of DataNode.

2023-11-16 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17063.

Fix Version/s: 3.4.0
   Resolution: Fixed

> Support to configure different capacity reserved for each disk of DataNode.
> ---
>
> Key: HDFS-17063
> URL: https://issues.apache.org/jira/browse/HDFS-17063
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Affects Versions: 3.3.6
>Reporter: Jiale Qi
>Assignee: Jiale Qi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Now _dfs.datanode.du.reserved_ takes effect for all directory of a datanode.
> This issue allows cluster administrator to configure 
> {_}dfs.datanode.du.reserved./data/hdfs1/data{_}, which only take effect for a 
> specific directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17172) Support FSNamesystemLock Parameters reconfigurable

2023-11-16 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17172.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Support FSNamesystemLock Parameters reconfigurable
> --
>
> Key: HDFS-17172
> URL: https://issues.apache.org/jira/browse/HDFS-17172
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> For Namesystem lock some parameters such as 
> "dfs.namenode.lock.detailed-metrics.enabled", 
> "dfs.namenode.read-lock-reporting-threshold-ms", 
> "dfs.namenode.write-lock-reporting-threshold-ms"
> we will support reconfigurable these parameters without namenode restart is 
> convenient for us to check the metrics operation of FSNamesystemLock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17231) HA: Safemode should exit when resources are from low to available

2023-10-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17231.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> HA: Safemode should exit when resources are from low to available
> -
>
> Key: HDFS-17231
> URL: https://issues.apache.org/jira/browse/HDFS-17231
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 3.3.4, 3.3.6
>Reporter: kuper
>Assignee: kuper
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: 企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png
>
>
> The NameNodeResourceMonitor automatically enters safe mode when it detects 
> that the resources are not sufficient. When zkfc detects insufficient 
> resources, it triggers failover. Consider the following scenario:
>  * Initially, nn01 is active and nn02 is standby. Due to insufficient 
> resources in dfs.namenode.name.dir, the NameNodeResourceMonitor detects the 
> resource issue and puts nn01 into safemode. Subsequently, zkfc triggers 
> failover.
>  * At this point, nn01 is in safemode (ON) and standby, while nn02 is in 
> safemode (OFF) and active.
>  * After a period of time, the resources in nn01's dfs.namenode.name.dir 
> recover, causing a slight instability and triggering failover again.
>  * Now, nn01 is in safe mode (ON) and active, while nn02 is in safe mode 
> (OFF) and standby.
>  * However, since nn01 is active but in safemode (ON), hdfs cannot be read 
> from or written to.
> !企业微信截图_75d15d37-26b7-4d88-ac0c-8d77e358761b.png!
> *reproduction*
>  # Increase the dfs.namenode.resource.du.reserved
>  # Increase the ha.health-monitor.check-interval.ms can avoid directly 
> switching to standby and stopping the NameNodeResourceMonitor thread. 
> Instead, it is necessary to wait for the NameNodeResourceMonitor to enter 
> safe mode before switching to standby.
>  # On the nn01 active node, using the dd command to create a file that 
> exceeds the threshold, triggering a low on available disk space condition. 
>  # If the nn01 namenode process is not dead, the situation of nn01 safemode 
> (ON) and standby occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17217) Add lifeline RPC start up log when NameNode#startCommonServices

2023-10-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17217.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Marked Resolution/Fix Version and Reviewed Flag while the PR has committed to 
trunk.
cc [~zhangshuyan].

> Add lifeline RPC start up  log when NameNode#startCommonServices
> 
>
> Key: HDFS-17217
> URL: https://issues.apache.org/jira/browse/HDFS-17217
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> If start up the lifeline RPC server in NameNode and need add lifeline RPC 
> start up log when NameNode#startCommonServices



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17208) Add the metrics PendingAsyncDiskOperations in datanode

2023-10-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17208.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add the metrics PendingAsyncDiskOperations  in datanode 
> 
>
> Key: HDFS-17208
> URL: https://issues.apache.org/jira/browse/HDFS-17208
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Consider should add  the metrics `PendingAsyncDiskOperations`  to be able to 
> track if we are queueing too many asynchronous disk operations in 
> FsDatasetAsyncDiskService of the datanode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17204) EC: Reduce unnecessary log when processing excess redundancy.

2023-09-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17204.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Reduce unnecessary log when processing excess redundancy.
> -
>
> Key: HDFS-17204
> URL: https://issues.apache.org/jira/browse/HDFS-17204
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> This is a follow-up of 
> [HDFS-16964|https://issues.apache.org/jira/browse/HDFS-16964]. We now avoid 
> stale replicas when dealing with redundancy. This may result in redundant 
> replicas not being in the `nonExcess` set when we enter 
> `BlockManager#chooseExcessRedundancyStriped` (because the datanode where the 
> redundant replicas are located has not send FBR yet, so those replicas are 
> filtered out and not added to the `nonExcess` set). A further result is that 
> no excess storage type is selected and the log "excess types chosen for 
> block..." is printed. When a failover occurs, a large number of datanodes 
> become stale, which causes NameNodes to print a large number of unnecessary 
> logs.
> This issue needs to be fixed, otherwise the performance after failover will 
> be affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17197) Show file replication when listing corrupt files.

2023-09-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17197.

Fix Version/s: 3.4.0
 Hadoop Flags: Incompatible change,Reviewed
   Resolution: Fixed

> Show file replication when listing corrupt files.
> -
>
> Key: HDFS-17197
> URL: https://issues.apache.org/jira/browse/HDFS-17197
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Files with different replication have different reliability guarantees. We 
> need to pay attention to corrupted files with a specified replication greater 
> than or equal to 3. So, when listing corrupt files, it would be useful to 
> display the corresponding replication of the files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17184) Improve BlockReceiver to throws DiskOutOfSpaceException when initialize

2023-09-21 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17184.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Improve BlockReceiver to throws DiskOutOfSpaceException when initialize
> ---
>
> Key: HDFS-17184
> URL: https://issues.apache.org/jira/browse/HDFS-17184
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> BlockReceiver class will receives a block and writes to its disk,
> in the constructor method, createTemporary and createRbw will execute 
> chooseVolume, and DiskOutOfSpaceException may occur in chooseVolume.
> current in the processing logic, if the exception occurs will be cacth by 
> BlockReceiver.java line_282 catch(IOException ioe) here, and cleanupBlock() 
> will be executed here.
> since the replica of the current block has not been added to ReplicaMap, 
> executing cleanupBlock will throw ReplicaNotFoundException.
> the ReplicaNotFoundException exception will overwrite the actual 
> DiskOutOfSpaceException, resulting in inaccurate exception information.
> {code:java}
> BlockReceiver(final ExtendedBlock block, final StorageType storageType,
>   final DataInputStream in,
>   final String inAddr, final String myAddr,
>   final BlockConstructionStage stage, 
>   final long newGs, final long minBytesRcvd, final long maxBytesRcvd, 
>   final String clientname, final DatanodeInfo srcDataNode,
>   final DataNode datanode, DataChecksum requestedChecksum,
>   CachingStrategy cachingStrategy,
>   final boolean allowLazyPersist,
>   final boolean pinning,
>   final String storageId) throws IOException {
> try{
>   ...
>  } catch (ReplicaAlreadyExistsException bae) {
>throw bae;
>  } catch (ReplicaNotFoundException bne) {
>throw bne;
>  } catch(IOException ioe) {
>   if (replicaInfo != null) {
> replicaInfo.releaseAllBytesReserved();
>   }
>   IOUtils.closeStream(this); 
>   cleanupBlock();// if ReplicaMap does not exist  replica  will throw 
> ReplicaNotFoundException
>   ...
>   throw ioe;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17105) mistakenly purge editLogs even after it is empty in NNStorageRetentionManager

2023-09-19 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17105.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
 Assignee: ConfX
   Resolution: Fixed

>  mistakenly purge editLogs even after it is empty in NNStorageRetentionManager
> --
>
> Key: HDFS-17105
> URL: https://issues.apache.org/jira/browse/HDFS-17105
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Assignee: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: reproduce.sh
>
>
> h2. What happened:
> Got {{IndexOutOfBoundsException}} after setting 
> {{dfs.namenode.max.extra.edits.segments.retained}} to a negative value and 
> purging old record with {{{}NNStorageRetentionManager{}}}.
> h2. Where's the bug:
> In line 156 of {{{}NNStorageRetentionManager{}}}, the manager trims 
> {{editLogs}} until it is under the {{{}maxExtraEditsSegmentsToRetain{}}}:
> {noformat}
> while (editLogs.size() > maxExtraEditsSegmentsToRetain) {
>       purgeLogsFrom = editLogs.get(0).getLastTxId() + 1;
>       editLogs.remove(0);
> }{noformat}
> However, if {{dfs.namenode.max.extra.edits.segments.retained}} is set to 
> below 0 the size of {{editLogs}} would never be below, resulting in 
> ultimately {{editLog.size()=0}} and thus {{editLogs.get(0)}} is out of range.
> h2. How to reproduce:
> (1) Set {{dfs.namenode.max.extra.edits.segments.retained}} to -1974676133
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager#testNoLogs}}
> h2. Stacktrace:
> {noformat}
> java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
>     at 
> java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
>     at 
> java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
>     at 
> java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
>     at java.base/java.util.Objects.checkIndex(Objects.java:372)
>     at java.base/java.util.ArrayList.get(ArrayList.java:459)
>     at 
> org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:157)
>     at 
> org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager.runTest(TestNNStorageRetentionManager.java:299)
>     at 
> org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager.testNoLogs(TestNNStorageRetentionManager.java:143){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17192) Add bock info when constructing remote block reader meets IOException

2023-09-18 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17192.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Add bock info when constructing remote block reader meets IOException
> -
>
> Key: HDFS-17192
> URL: https://issues.apache.org/jira/browse/HDFS-17192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Currently, when constructing remote block reader meets IOException, it will 
> not log block info. We should add it for troubleshooting problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17190) EC: Fix bug of OIV processing XAttr.

2023-09-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17190.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
 Assignee: Shuyan Zhang
   Resolution: Fixed

> EC: Fix bug of OIV processing XAttr.
> 
>
> Key: HDFS-17190
> URL: https://issues.apache.org/jira/browse/HDFS-17190
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When we need to use OIV to print EC information for a directory, 
> `PBImageTextWriter#getErasureCodingPolicyName` will be called. Currently, 
> this method uses `XATTR_ERASURECODING_POLICY.contains(xattr.getName())` to 
> filter and obtain EC XAttr, which is very dangerous. If we have an XAttr 
> whose name happens to be a substring of `hdfs.erasurecoding.policy`, then 
> `getErasureCodingPolicyName` will return the wrong result. Our internal 
> production environment has customized some XAttrs, and this bug caused errors 
> in the parsing results of OIV when using `-ec` option. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17140) Revisit the BPOfferService.reportBadBlocks() method.

2023-09-06 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17140.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Revisit the BPOfferService.reportBadBlocks() method.
> 
>
> Key: HDFS-17140
> URL: https://issues.apache.org/jira/browse/HDFS-17140
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Liangjun He
>Assignee: Liangjun He
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The current BPOfferService.reportBadBlocks() method can be optimized by 
> moving the creation of the rbbAction object outside the loop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16933) A race in SerialNumberMap will cause wrong owner, group and XATTR

2023-09-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16933.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> A race in SerialNumberMap will cause wrong owner, group and XATTR
> -
>
> Key: HDFS-16933
> URL: https://issues.apache.org/jira/browse/HDFS-16933
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> If namenode enables parallel fsimage loading, a race that occurs in 
> SerialNumberMap will cause wrong owner ship for INodes.
> {code:java}
> public int get(T t) {
>   if (t == null) {
> return 0;
>   }
>   Integer sn = t2i.get(t);
>   if (sn == null) {
> // Assume there are two thread with different t, such as:
> // T1 with hbase
> // T2 with hdfs
> // If T1 and T2 get the sn in the same time, they will get the same sn, 
> such as 10
> sn = current.getAndIncrement();
> if (sn > max) {
>   current.getAndDecrement();
>   throw new IllegalStateException(name + ": serial number map is full");
> }
> Integer old = t2i.putIfAbsent(t, sn);
> if (old != null) {
>   current.getAndDecrement();
>   return old;
> }
> // If T1 puts the 10->hbase to the i2t first, T2 will use 10 -> hdfs to 
> overwrite it. So it will cause that the Inodes will get a wrong owner hdfs, 
> actual it should be hbase.
> i2t.put(sn, t);
>   }
>   return sn;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17093) Fix block report lease issue to avoid missing some storages report.

2023-08-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17093.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix block report lease issue to avoid missing some storages report.
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Assignee: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17151) EC: Fix wrong metadata in BlockInfoStriped after recovery

2023-08-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17151.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix wrong metadata in BlockInfoStriped after recovery
> -
>
> Key: HDFS-17151
> URL: https://issues.apache.org/jira/browse/HDFS-17151
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When the datanode completes a block recovery, it will call 
> `commitBlockSynchronization` method to notify NN the new locations of the 
> block. For a EC block group, NN determines the index of each internal block 
> based on the position of the DatanodeID in the parameter `newtargets`.
> If the internal blocks written by the client don't have continuous indices, 
> the current datanode code might cause NN to record incorrect block metadata. 
> For simplicity, let's take RS (3,2) as an example. The timeline of the 
> problem is as follows:
> 1. The client plans to write internal blocks with indices [0,1,2,3,4] to 
> datanode [dn0, dn1, dn2, dn3, dn4] respectively. But dn1 is unable to 
> connect, so the client only writes data to the remaining 4 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the content of `uc. getExpectedStorageLocations()` completely depends 
> on block reports, and now it is ;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. Datanode that receives the recovery command fills `DatanodeID [] newLocs` 
> with [dn0, null, dn2, dn3, dn4];
> 7. The serialization process filters out null values, so the parameters 
> passed to NN become [dn0, dn2, dn3, dn4];
> 8. NN mistakenly believes that dn2 stores an internal block with index 1, dn3 
> stores an internal block with index 2, and so on.
> The above timeline is just an example, and there are other situations that 
> may result in the same error, such as an update pipeline occurs on the client 
> side. We should fix this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17154) EC: Fix bug in updateBlockForPipeline after failover

2023-08-16 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17154.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix bug in updateBlockForPipeline after failover
> 
>
> Key: HDFS-17154
> URL: https://issues.apache.org/jira/browse/HDFS-17154
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In the method `updateBlockForPipeline`, NameNode uses the 
> `BlockUnderConstructionFeature` of a BlockInfo to generate the member 
> `blockIndices` of `LocatedStripedBlock`. 
> And then, NameNode uses `blockIndices` to generate block tokens for client.
> However, if there is a failover, the location info in 
> BlockUnderConstructionFeature may be incomplete, which results in the absence 
> of the corresponding block tokens.
> When the client receives these incomplete block tokens, it will throw a NPE 
> because `updatedBlks[i]` is null.
> NameNode should just return block tokens for all indices to the client. 
> Client can pick whichever it likes to use. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17150.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17137) Standby/Observer NameNode skip to handle redundant replica block logic when set decrease replication.

2023-08-08 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17137.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Standby/Observer NameNode skip to handle redundant replica block logic when 
> set decrease replication. 
> --
>
> Key: HDFS-17137
> URL: https://issues.apache.org/jira/browse/HDFS-17137
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Standby/Observer NameNode should not handle redundant replica block logic 
> when set decrease replication.
> At present, when call setReplication to execute the logic of  decrease 
> replication, 
> * ActiveNameNode will call the BlockManager#processExtraRedundancyBlock 
> method to select the dn of the redundant replica , will add to the 
> excessRedundancyMap and add to invalidateBlocks (RedundancyMonitor will be 
> scheduled to delete the block on dn).
> * Then the StandyNameNode or ObserverNameNode load editlog and apply the 
> SetReplicationOp, if the dn of the replica to be deleted has not yet 
> performed incremental block report,
> here also will BlockManager#processExtraRedundancyBlock method be called here 
> to select the dn of the redundant replica and add it to the 
> excessRedundancyMap (here selected the redundant dn  may be inconsistent with 
> the dn selected in the active namenode).
> In excessRedundancyMap exist dn maybe affects the dn decommission, resulting 
> can not to complete decommission dn operation in Standy/ObserverNameNode.
> The specific cases are as follows:
> For example a file is 3 replica (d1,d2,d3)  and call setReplication set file 
> to 2 replica.
> * ActiveNameNode  select d1 with redundant replicas to add 
> toexcessRedundancyMap and invalidateBlocks.
> * StandyNameNode replays SetReplicationOp (at this time, d1 has not yet 
> executed incremental block report), so here maybe selected redundant replica 
> dn are inconsistent with ActiveNameNode, such as select d2 to add  
> excessRedundancyMap.
> * At this time, d1 completes deleting the block for incremental block report.
> * The DN list for this block in ActiveNameNode includes d2 and d3 (delete d1 
> from in the excessRedundancyMap when processing the incremental block report 
> ).
> * The DN list for this block in StandyNameNode includes d2 and d3  (can not 
> delete d2 from in the excessRedundancyMap when processing the incremental 
> block report).
> At this time, execute the decommission operation on d3.
> * ActiveNameNode will select a new node d4 to copy the replica, and d4 will 
> run incrementally block report.
> * The DN list for this block in ActiveNameNode includes d2 and 
> d3(decommissioning status),d4, then d3 can to decommissioned normally.
> * The DN list for this block in StandyNameNode is d3 (decommissioning 
> status), d2 (redundant status), d4.  
> since the requirements for two live replica are not met, d3 cannot be 
> decommissioned at this time.
> Therefore, StandyNameNode or ObserverNameNode considers not process redundant 
> replicas logic when call setReplication.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17134) RBF: Fix duplicate results of getListing through Router.

2023-08-01 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17134.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: Fix duplicate results of getListing through Router.
> 
>
> Key: HDFS-17134
> URL: https://issues.apache.org/jira/browse/HDFS-17134
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The result of `getListing` in NameNode are sorted based on `byte[]`, while 
> the Router side is based on `String`. If there are special characters in 
> path, the sorting result of the router will be inconsistent with the 
> namenode. This may result in duplicate `getListing` results obtained by the 
> client due to wrong `startAfter` parameter.
> For exemple, namenode returns [path1, path2, path3] for a `getListing` 
> request, while router returns [path1, path3, path2] to client. Then client 
> will pass `path2` as `startAfter`  at the next iteration, so it will receive 
> `path3` again.
> We need to fix the Router code so that the order of its result is the same as 
> NameNode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17116) RBF: Update invoke millisecond time as monotonicNow() in RouterSafemodeService.

2023-07-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17116.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: Update invoke millisecond time as monotonicNow() in 
> RouterSafemodeService.
> ---
>
> Key: HDFS-17116
> URL: https://issues.apache.org/jira/browse/HDFS-17116
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The following exceptions occurred in our online environment:
> # After the machine restarts, the system time is abnormal, is a time in the 
> future
> # After starting the router, there is log "safemode exit for 24981702 
> milliseconds...", which has been in the safemode state,
> this is mainly because the startupTime is recorded as the future system time 
> when router is started at this time, and the system time returns to normal 
> soon, resulting in a negative delta,
> at this time, the service can only be restored by restart the router service.
> The relevant logs are:
> {code:java}
> 2023-07-15 03:15:49,276 INFO  ipc.Server xxx
> 2023-07-15 11:21:03,785 INFO  router.DFSRouter (LogAdapter.java:info(51)) 
> [main] - STARTUP_MSG:
> /
> STARTUP_MSG: Starting Router
> ...
> 2023-07-15 11:21:51,325 INFO xxx
> 2023-07-15 03:22:00,257 INFO xxx
> 2023-07-15 03:22:29,829 INFO router.RouterSafemodeService 
> (RouterSafemodeService.java:periodicInvoke(167)) [RouterSafemodeService-0] - 
> Delaying safemode exit for 28761777 milliseconds...
> {code}
> Maybe we can be compatible with this case at the code level, and reset the 
> startupTime and enterSafeModeTime in the case of a negative delta,
> which can ensure that the router service can also exit the safemode state 
> normally after the system time returns to normal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17117) Print reconstructionQueuesInitProgress periodically when BlockManager processMisReplicatesAsync.

2023-07-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17117.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Print reconstructionQueuesInitProgress periodically when BlockManager 
> processMisReplicatesAsync.
> 
>
> Key: HDFS-17117
> URL: https://issues.apache.org/jira/browse/HDFS-17117
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> BlockManager#processMisReplicatesAsync can periodically print 
> reconstructionQueuesInitProgress,  so that the admin can get the progress of 
> the replication queues initialisation.
> Ready to add logs and metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17112) Show decommission duration in JMX and HTML

2023-07-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17112.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Show decommission duration in JMX and HTML
> --
>
> Key: HDFS-17112
> URL: https://issues.apache.org/jira/browse/HDFS-17112
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Expose decommission duration time in JMX page. It's a very useful info when 
> decommissioning a batch of datanodes in a cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17090) Decommission will be stuck for long time when restart because overlapped process Register and BlockReport.

2023-07-16 Thread Xiaoqiao He (Jira)
Xiaoqiao He created HDFS-17090:
--

 Summary: Decommission will be stuck for long time when restart 
because overlapped process Register and BlockReport.
 Key: HDFS-17090
 URL: https://issues.apache.org/jira/browse/HDFS-17090
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Xiaoqiao He
Assignee: Xiaoqiao He


I met one corner case recently, which decommission DataNode impact performance 
of NameNode. After dig carefully, I have reproduced this case.
a. Add some DataNodes to exclude and prepare to decommission this Datanodes.
b. Execute bin/hdfs dfsadmin -refresh (This is optional step).
c. Restart NameNode for upgrade or other reason before complete to decommission.
d. All DataNodes will be trigger to register and FBR.
e. Considering that the load of NameNode will be very high, especially 8040 
CallQueue will be full for a long time because RPC flood about 
register/heartbeat/FBR from DataNodes.
f. For one decommission in-progress node, it will not complete to decommission 
until next FBR even all replicas of this node has been processed, because the 
request order register-heartbeat-(blockreport, register), and the second 
register could be one retry RPC request from DataNode (No more log information 
from DataNode to confirm), and for (blockreport, register), NameNode could 
process one storage then process register then process remaining storages in 
order. 
g. Because the second register RPC, the related DataNodes will be marked 
unhealthy by BlockManager#isNodeHealthyForDecommissionOrMaintenance. So 
decommission will be stuck for long time until next FBR. Thus NameNode need to 
scan this DataNode at every round to check if could complete which hold the 
global write lock and impact performance of NameNode.

To improve it, I think we could filter the repeated register RPC request at 
startup progress. Not think carefully if it will involve other risks when 
filter register directly. Welcome anymore discussions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17033) Update fsck to display stale state info of blocks accurately

2023-07-10 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17033.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Update fsck to display stale state info of blocks accurately
> 
>
> Key: HDFS-17033
> URL: https://issues.apache.org/jira/browse/HDFS-17033
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namanode
>Reporter: WangYuanben
>Assignee: WangYuanben
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When the DN is stale, Block replica on this DN should be "STALE" instead of 
> "HEALTHY" in block check of fsck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17073) Enhance the warning message output for BlockGroupNonStripedChecksumComputer#compute

2023-07-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17073.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Enhance the warning message output for 
> BlockGroupNonStripedChecksumComputer#compute
> ---
>
> Key: HDFS-17073
> URL: https://issues.apache.org/jira/browse/HDFS-17073
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Consider improving the log output of the warning messages generated by 
> BlockGroupNonStripedChecksumComputer when calling checksumBlock, to make it 
> easier to locate the block information where an exception occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17052) Improve BlockPlacementPolicyRackFaultTolerant to avoid choose nodes failed when no enough Rack.

2023-07-02 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17052.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. And mark it is contributed by both [~zhtttylzz] and 
[~zhangshuyan]. Thanks.

> Improve BlockPlacementPolicyRackFaultTolerant to avoid choose nodes failed 
> when no enough Rack.
> ---
>
> Key: HDFS-17052
> URL: https://issues.apache.org/jira/browse/HDFS-17052
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Affects Versions: 3.4.0
>Reporter: Hualong Zhang
>Assignee: Hualong Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: failed reconstruction ec in same rack-1.png, write ec in 
> same rack.png
>
>
> When writing EC data, if the number of racks matching the storageType is 
> insufficient, more than one block are allowed to be written to the same rack
> !write ec in same rack.png|width=962,height=604!
>  
>  
>  
> However, during EC block recovery, it is not possible to recover on the same 
> rack, which deviates from the expected behavior.
> !failed reconstruction ec in same rack-1.png|width=946,height=413!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17044) Set size of non-exist block to NO_ACK when process FBR or IBR to avoid useless report from DataNode

2023-06-27 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17044.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Set size of non-exist block to NO_ACK when process FBR or IBR to avoid 
> useless report from DataNode
> ---
>
> Key: HDFS-17044
> URL: https://issues.apache.org/jira/browse/HDFS-17044
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When NameNode processes DataNode increment or full block report, if block is 
> not in the blocks map, it will be added to invalidate and the replica should 
> be removed from the data-node, and the block size should be to set NO_ACK,  
> should be reduce some useless DataNode block reports.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17049) EC: Fix duplicate block group IDs generated by SequentialBlockGroupIdGenerator

2023-06-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17049.

Resolution: Not A Problem

> EC: Fix duplicate block group IDs generated by SequentialBlockGroupIdGenerator
> --
>
> Key: HDFS-17049
> URL: https://issues.apache.org/jira/browse/HDFS-17049
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> When I used multiple clients to write EC files concurrently, I found that 
> NameNode generated the same block group ID for different files:
> ```
> 2023-06-13 20:09:59,514 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_-9223372036854697568_14389 for /ec-test/10/4068034329705654124
> 2023-06-13 20:09:59,514 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocate blk_-9223372036854697568_14390 for /ec-test/19/7042966144171770731
> ```
> After diving into `SequentialBlockGroupIdGenerator`, I found that the current 
> implementation of `nextValue` is not thread-safe.
> This problem must be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17037) Consider nonDfsUsed when running balancer

2023-06-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17037.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Consider nonDfsUsed when running balancer
> -
>
> Key: HDFS-17037
> URL: https://issues.apache.org/jira/browse/HDFS-17037
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When we run balancer with `BalancingPolicy.Node` policy, our goal is to make 
> each datanode storage balanced. But in the current implementation, the 
> balancer doesn't account for storage used by non-dfs on the datanodes, which 
> can make the situation worse for datanodes that are already strained on 
> storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17003) Erasure Coding: invalidate wrong block after reporting bad blocks from datanode

2023-06-08 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17003.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Erasure Coding: invalidate wrong block after reporting bad blocks from 
> datanode
> ---
>
> Key: HDFS-17003
> URL: https://issues.apache.org/jira/browse/HDFS-17003
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After receiving reportBadBlocks RPC from datanode, NameNode compute wrong 
> block to invalidate. It is a dangerous behaviour and may cause data loss. 
> Some logs in our production as below:
>  
> NameNode log:
> {code:java}
> 2023-05-08 21:23:49,112 INFO org.apache.hadoop.hdfs.StateChange: *DIR* 
> reportBadBlocks for block: 
> BP-932824627--1680179358678:blk_-9223372036848404320_1471186 on datanode: 
> datanode1:50010
> 2023-05-08 21:23:49,183 INFO org.apache.hadoop.hdfs.StateChange: *DIR* 
> reportBadBlocks for block: 
> BP-932824627--1680179358678:blk_-9223372036848404319_1471186 on datanode: 
> datanode2:50010{code}
> datanode1 log:
> {code:java}
> 2023-05-08 21:23:49,088 WARN 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner: Reporting bad 
> BP-932824627--1680179358678:blk_-9223372036848404320_1471186 on 
> /data7/hadoop/hdfs/datanode
> 2023-05-08 21:24:00,509 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Failed 
> to delete replica blk_-9223372036848404319_1471186: ReplicaInfo not 
> found.{code}
>  
> This phenomenon can be reproduced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17026) RBF: NamenodeHeartbeatService should update JMX report with configurable frequency

2023-05-31 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-17026.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: NamenodeHeartbeatService should update JMX report with configurable 
> frequency
> --
>
> Key: HDFS-17026
> URL: https://issues.apache.org/jira/browse/HDFS-17026
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Hector Sandoval Chaverri
>Assignee: Hector Sandoval Chaverri
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-17026-branch-3.3.patch
>
>
> The NamenodeHeartbeatService currently calls each of the Namenode's JMX 
> endpoint every time it wakes up (default value is every 5 seconds).
> In a cluster with 40 routers, we have observed service degradation on some of 
> the  Namenodes, since the JMX request obtains Datanode status and blocks 
> other RPC requests. However, JMX report data doesn't seem to be used for 
> critical paths on the routers.
> We should configure the NamenodeHeartbeatService so it updates the JMX 
> reports on a slower frequency than the Namenode states or to disable the 
> reports completely.
> The class calls out the JMX request being optional even though there is no 
> implementation to turn it off:
> {noformat}
> // Read the stats from JMX (optional)
> updateJMXParameters(webAddress, report);{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16985) Fix data missing issue when delete local block file.

2023-05-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16985.

   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s:   (was: 3.3.1)
  Resolution: Fixed

> Fix data missing issue when delete local block file.
> 
>
> Key: HDFS-16985
> URL: https://issues.apache.org/jira/browse/HDFS-16985
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> We encounterd several missing-block problem in our production cluster which  
> hdfs  running on AWS EC2 + EBS.
> The root cause:
>  # the block remains only 1 replication left and hasn't been reconstruction
>  # DN checks block file existing when BlockSender construction
>  # the EBS checking failed and throw FileNotFoundException (EBS may be in 
> fault condition)
>  # DN invalidateBlock and schedule block  async deletion
>  # EBS already back to normal when DN do delete block
>  # the block file be delete permanently and can't be recovered



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16999) Fix wrong use of processFirstBlockReport()

2023-05-08 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16999.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix wrong use of processFirstBlockReport()
> --
>
> Key: HDFS-16999
> URL: https://issues.apache.org/jira/browse/HDFS-16999
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> `processFirstBlockReport()` is used to process first block report from 
> datanode. It does not calculating `toRemove` list because it believes that 
> there is no metadata about the datanode in the namenode. However, If a 
> datanode is re registered after restarting, its `blockReportCount` will be 
> updated to 0. That is to say, the first block report after a datanode 
> restarts will be processed by `processFirstBlockReport()`.  This is 
> unreasonable because the metadata of the datanode already exists in namenode 
> at this time, and if redundant replica metadata is not removed in time, the 
> blocks with insufficient replicas cannot be reconstruct in time, which 
> increases the risk of missing block. In summary, `processFirstBlockReport()` 
> should only be used when the namenode restarts, not when the datanode 
> restarts. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16997) Set the locale to avoid printing useless logs in BlockSender

2023-05-02 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16997.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Set the locale to avoid printing useless logs in BlockSender
> 
>
> Key: HDFS-16997
> URL: https://issues.apache.org/jira/browse/HDFS-16997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> In our production environment, if the hadoop process is started in a 
> non-English environment, many unexpected error logs will be printed. The 
> following is the error message printed by datanode.
> ```
> 2023-05-01 09:10:50,299 ERROR 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider: error in op 
> transferToSocketFully : 断开的管道
> 2023-05-01 09:10:50,299 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: BlockSender.sendChunks() 
> exception: 
> java.io.IOException: 断开的管道
> at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
> at 
> sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
> at 
> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
> at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
> at 
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:242)
> at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:260)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:559)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:801)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:755)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:580)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:258)
> at java.lang.Thread.run(Thread.java:745)
> 2023-05-01 09:10:50,298 ERROR 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider: error in op 
> transferToSocketFully : 断开的管道
> 2023-05-01 09:10:50,298 ERROR 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider: error in op 
> transferToSocketFully : 断开的管道
> 2023-05-01 09:10:50,298 ERROR 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider: error in op 
> transferToSocketFully : 断开的管道
> 2023-05-01 09:10:50,298 ERROR 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider: error in op 
> transferToSocketFully : 断开的管道
> 2023-05-01 09:10:50,302 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: BlockSender.sendChunks() 
> exception: 
> java.io.IOException: 断开的管道
> at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
> at 
> sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
> at 
> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
> at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
> at 
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:242)
> at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.transferToSocketFully(FileIoProvider.java:260)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:559)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:801)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:755)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:580)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:258)
> at java.lang.Thread.run(Thread.java:745)
> 2023-05-01 09:10:50,303 ERROR 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider: error in op 
> transferToSocketFully : 断开的管道
> 2023-05-01 09:10:50,303 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: BlockSender.sendChunks() 
> exception: 
> java.io.IOException: 断开的管道
> at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
> at 
> sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
> at 
> 

[jira] [Resolved] (HDFS-16986) EC: Fix locationBudget in getListing()

2023-04-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16986.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix locationBudget in getListing()
> --
>
> Key: HDFS-16986
> URL: https://issues.apache.org/jira/browse/HDFS-16986
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The current `locationBudget` is estimated using the `block_replication` in 
> `FileStatus`, which is unreasonable on EC files, because it will count the 
> number of locations of a EC block as 1. We should consider 
> ErasureCodingPolicy of the files to keep the meaning of `locationBudget` 
> consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16974) Consider volumes average load of each DataNode when choosing target.

2023-04-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16974.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Consider volumes average load of each DataNode when choosing target.
> 
>
> Key: HDFS-16974
> URL: https://issues.apache.org/jira/browse/HDFS-16974
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The current target choosing policy only considers the load of the entire 
> datanode. If both DN1 and DN2 have an `xceiverCount` of 100, but DN1 has 10 
> volumes to write to and DN2 only has 1, then the pressure on DN2 is actually 
> much greater than that on DN1. This patch has added a configuration that 
> allows us to avoid nodes with too much pressure on a single volume when 
> choosing targets, so as to avoid overloading datanodes with few volumes or 
> slowing down writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16964) Improve processing of excess redundancy after failover

2023-03-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16964.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Improve processing of excess redundancy after failover
> --
>
> Key: HDFS-16964
> URL: https://issues.apache.org/jira/browse/HDFS-16964
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After failover, the block with excess redundancy cannot be processed until 
> all replicas are not stale, because the stale ones may have been deleted. 
> That is to say, we need to wait for the FBRs of all datanodes on which the 
> block resides before deleting the redundant replicas. This is unnecessary, we 
> can bypass stale replicas when dealing with excess replicas, and delete 
> non-stale excess replicas in a more timely manner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16948) Update log of BlockManager#chooseExcessRedundancyStriped when EC internal block is moved by balancer

2023-03-23 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16948.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Update log of BlockManager#chooseExcessRedundancyStriped when EC internal 
> block is moved by balancer
> 
>
> Key: HDFS-16948
> URL: https://issues.apache.org/jira/browse/HDFS-16948
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: Kidd5368
>Assignee: Kidd5368
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> This is a follow-up of HDFS-16179.When the EC internal block is moved by the 
> balancer, it will trigger chooseExcessRedundancyStriped( ), and the parameter 
> delNodeHint is not null.So the nonExcess list will be modified in 
> processChosenExcessRedundancy( ), normally the size of nonExcess will be 
> reduced to the expected EC group size, that's why the annoying log "excess 
> types chose ... is empty." will be printed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16939) Fix the thread safety bug in LowRedundancyBlocks

2023-03-06 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16939.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix the thread safety bug in LowRedundancyBlocks
> 
>
> Key: HDFS-16939
> URL: https://issues.apache.org/jira/browse/HDFS-16939
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namanode
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The remove method in LowRedundancyBlocks is not protected by synchronized. 
> This method is private and is called by BlockManager. As a result, 
> priorityQueues has the risk of being accessed concurrently by multiple 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16898) Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat

2023-02-07 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16898.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove write lock for processCommandFromActor of DataNode to reduce impact on 
> heartbeat
> ---
>
> Key: HDFS-16898
> URL: https://issues.apache.org/jira/browse/HDFS-16898
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Assignee: ZhangHB
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Now in method processCommandFromActor,  we have code like below:
>  
> {code:java}
> writeLock();
> try {
>   if (actor == bpServiceToActive) {
> return processCommandFromActive(cmd, actor);
>   } else {
> return processCommandFromStandby(cmd, actor);
>   }
> } finally {
>   writeUnlock();
> } {code}
> if method processCommandFromActive costs much time, the write lock would not 
> release.
>  
> It maybe block the updateActorStatesFromHeartbeat method in 
> offerService,furthermore, it can cause the lastcontact of datanode very high, 
> even dead when lastcontact beyond 600s.
> {code:java}
> bpos.updateActorStatesFromHeartbeat(
> this, resp.getNameNodeHaState());{code}
> here we can make write lock fine-grain in processCommandFromActor method to 
> address this problem
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16868) Fix audit log duplicate issue when an ACE occurs in FSNamesystem.

2022-12-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16868.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Fix audit log duplicate issue when an ACE occurs in FSNamesystem.
> -
>
> Key: HDFS-16868
> URL: https://issues.apache.org/jira/browse/HDFS-16868
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Beibei Zhao
>Assignee: Beibei Zhao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> checkSuperuserPrivilege call logAuditEvent and throw ace when an 
> AccessControlException occurs.
> {code:java}
>   // This method logs operationName without super user privilege.
>   // It should be called without holding FSN lock.
>   void checkSuperuserPrivilege(String operationName, String path)
>   throws IOException {
> if (isPermissionEnabled) {
>   try {
> FSPermissionChecker.setOperationType(operationName);
> FSPermissionChecker pc = getPermissionChecker();
> pc.checkSuperuserPrivilege(path);
>   } catch(AccessControlException ace){
> logAuditEvent(false, operationName, path);
> throw ace;
>   }
> }
>   }
> {code}
> It' s callers like metaSave call it like this: 
> {code:java}
>   /**
>* Dump all metadata into specified file
>* @param filename
>*/
>   void metaSave(String filename) throws IOException {
> String operationName = "metaSave";
> checkSuperuserPrivilege(operationName);
> ..
> try {
> ..
> metaSave(out);
> ..
>   }
> } finally {
>   readUnlock(operationName, getLockReportInfoSupplier(null));
> }
> logAuditEvent(true, operationName, null);
>   }
> {code}
> but setQuota, addCachePool, modifyCachePool, removeCachePool, 
> createEncryptionZone and reencryptEncryptionZone catch the ace and log the 
> same msg again, it' s a waste of memory I think: 
> {code:java}
>   /**
>* Set the namespace quota and storage space quota for a directory.
>* See {@link ClientProtocol#setQuota(String, long, long, StorageType)} for 
> the
>* contract.
>* 
>* Note: This does not support ".inodes" relative path.
>*/
>   void setQuota(String src, long nsQuota, long ssQuota, StorageType type)
>   throws IOException {
> ..
> try {
>   if(!allowOwnerSetQuota) {
> checkSuperuserPrivilege(operationName, src);
>   }
>  ..
> } catch (AccessControlException ace) {
>   logAuditEvent(false, operationName, src);
>   throw ace;
> }
> getEditLog().logSync();
> logAuditEvent(true, operationName, src);
>   }
> {code}
> Maybe we should move the checkSuperuserPrivilege out of the try block as 
> metaSave and other callers do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16785) Avoid to hold write lock to improve performance when add volume.

2022-11-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16785.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander] for your works.

> Avoid to hold write lock to improve performance when add volume. 
> -
>
> Key: HDFS-16785
> URL: https://issues.apache.org/jira/browse/HDFS-16785
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When patching the fine-grained locking of datanode, I  found that `addVolume` 
> will hold the write block of the BP lock to scan the new volume to get the 
> blocks. If we try to add one full volume that was fixed offline before, i 
> will hold the write lock for a long time.
> The related code as bellows:
> {code:java}
> for (final NamespaceInfo nsInfo : nsInfos) {
>   String bpid = nsInfo.getBlockPoolID();
>   try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> fsVolume.addBlockPool(bpid, this.conf, this.timer);
> fsVolume.getVolumeMap(bpid, tempVolumeMap, ramDiskReplicaTracker);
>   } catch (IOException e) {
> LOG.warn("Caught exception when adding " + fsVolume +
> ". Will throw later.", e);
> exceptions.add(e);
>   }
> } {code}
> And I noticed that this lock is added by HDFS-15382, means that this logic is 
> not in lock before. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16787) Remove the redundant lock in DataSetLockManager#removeLock.

2022-10-10 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16787.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove the redundant lock in DataSetLockManager#removeLock.
> ---
>
> Key: HDFS-16787
> URL: https://issues.apache.org/jira/browse/HDFS-16787
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> During patching the datanode fine-grained locking, found there is a redundant 
> lock in DataSetLockManager#removeLock, and the code as bellow:
> {code:java}
> @Override
> public void removeLock(LockLevel level, String... resources) {
>   String lockName = generateLockName(level, resources);
>   try (AutoCloseDataSetLock lock = writeLock(level, resources)) {
> // Here, this lock is redundant.
> lock.lock();
> lockMap.removeLock(lockName);
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16783) Remove the redundant lock in deepCopyReplica and getFinalizedBlocks

2022-10-10 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16783.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Remove the redundant lock in deepCopyReplica and getFinalizedBlocks
> ---
>
> Key: HDFS-16783
> URL: https://issues.apache.org/jira/browse/HDFS-16783
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When patching the fine-grained locking of datanode, found there is a 
> redundant lock in deepCopyReplica, maybe we can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16798) SerialNumberMap should decrease current counter if the item exist

2022-10-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16798.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> SerialNumberMap should decrease current counter if the item exist
> -
>
> Key: HDFS-16798
> URL: https://issues.apache.org/jira/browse/HDFS-16798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> During looking into some code related XATTR, I found there is a bug in 
> SerialNumberMap, as bellow:
> {code:java}
> public int get(T t) {
>   if (t == null) {
> return 0;
>   }
>   Integer sn = t2i.get(t);
>   if (sn == null) {
> sn = current.getAndIncrement();
> if (sn > max) {
>   current.getAndDecrement();
>   throw new IllegalStateException(name + ": serial number map is full");
> }
> Integer old = t2i.putIfAbsent(t, sn);
> if (old != null) {
>   // here: if the old is not null, we should decrease the current value.
>   return old;
> }
> i2t.put(sn, t);
>   }
>   return sn;
> } {code}
> This bug will only cause that the capacity of serialNumberMap is less than 
> expected, no other impact.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16781) setReplication方法不能设置整个目录的副本吗?whole folder?

2022-09-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16781.

Resolution: Invalid

[~zdltvxq] Thanks for your report. IIRC, setReplication api is kept for long 
time and there is no other interface to recursive setrep (include the latest 
release version). IMO it could prevent from overload of NameNode to process 
many files at one request. Another side, for Shell, it also get all files and 
request setReplication one by one.

BTW, just recommend: a). use English language when file new ticket. b) please 
send mail to u...@hadoop.apache.org or user...@hadoop.apache.org (which is for 
Chinese language) first when meet any issues to reach out. Generally, ticket 
here is about bugfix/feature/improvement for devs.

Thanks again. 

> setReplication方法不能设置整个目录的副本吗?whole folder?
> --
>
> Key: HDFS-16781
> URL: https://issues.apache.org/jira/browse/HDFS-16781
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfs
>Affects Versions: 2.6.0
> Environment: org.apache.hadoop.fs.FileSystem
>Reporter: zdl
>Priority: Blocker
>
> org.apache.hadoop.fs.FileSystem里有个setReplication方法可以设置目标文件的副本数,但是只能指定到文件,如果指定目录,就不生效。
> 我希望指定目录时,整个目录下所有文件和子目录下的文件都能设置副本,实际上在命令行中用hadoop fs -setrep /dirs 
> 方法是可以实现这个目标的,为什么在这个java api里面反倒不行?
> 或者是说,这个问题是否已经在某个新版本里解决了?请指点!谢谢。
> I want to set replications for whole folder's files by 
> org.apache.hadoop.fs.FileSystem setReplication function, while now in v2.6.0 
> it can only set to a file.
> Is there any solutions or a later version can solve it ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16593) Correct inaccurate BlocksRemoved metric on DataNode side

2022-09-05 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16593.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk.

> Correct inaccurate BlocksRemoved metric on DataNode side
> 
>
> Key: HDFS-16593
> URL: https://issues.apache.org/jira/browse/HDFS-16593
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When tracing the root cause of production issue, I found that the 
> BlocksRemoved  metric on Datanode size was inaccurate.
> {code:java}
> case DatanodeProtocol.DNA_INVALIDATE:
>   //
>   // Some local block(s) are obsolete and can be 
>   // safely garbage-collected.
>   //
>   Block toDelete[] = bcmd.getBlocks();
>   try {
> // using global fsdataset
> dn.getFSDataset().invalidate(bcmd.getBlockPoolId(), toDelete);
>   } catch(IOException e) {
> // Exceptions caught here are not expected to be disk-related.
> throw e;
>   }
>   dn.metrics.incrBlocksRemoved(toDelete.length);
>   break;
> {code}
> Because even if the invalidate method throws an exception, some blocks may 
> have been successfully deleted internally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16735) Reduce the number of HeartbeatManager loops

2022-08-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16735.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~zhangshuyan] for your contributions!

> Reduce the number of HeartbeatManager loops
> ---
>
> Key: HDFS-16735
> URL: https://issues.apache.org/jira/browse/HDFS-16735
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> HeartbeatManager only processes one dead datanode (and failed storage) per 
> round in heartbeatCheck(), that is to say, if there are ten failed storages, 
> all datanode states need to be scanned 10 times, which is unnecessary and a 
> waste of resources. This patch makes the number of bad storages processed per 
> scan configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16717) Replace NPE with IOException in DataNode.class

2022-08-23 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16717.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander]!

> Replace NPE with IOException in DataNode.class
> --
>
> Key: HDFS-16717
> URL: https://issues.apache.org/jira/browse/HDFS-16717
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In current logic, if storage not yet initialized, it will throw a NPE in 
> DataNode.class. Developers or SREs are very sensitive to NPE, so I feel that 
> we can use IOException instead of NPE when storage not yet initialized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16704) Datanode return empty response instead of NPE for GetVolumeInfo during restarting

2022-08-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16704.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander].

> Datanode return empty response instead of NPE for GetVolumeInfo during 
> restarting
> -
>
> Key: HDFS-16704
> URL: https://issues.apache.org/jira/browse/HDFS-16704
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> During datanode starting, I found some NPE in logs:
> {code:java}
> Caused by: java.lang.NullPointerException: Storage not yet initialized
>     at 
> org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:899)
>     at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.getVolumeInfo(DataNode.java:3533)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:72)
>     at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:276)
>     at 
> com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:193)
>     at 
> com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:175)
>  {code}
> Because the storage of datanode not yet initialized when we trying to get 
> metrics of datanode, and related code as below:
> {code:java}
> @Override // DataNodeMXBean
> public String getVolumeInfo() {
>   Preconditions.checkNotNull(data, "Storage not yet initialized");
>   return JSON.toString(data.getVolumeInfoMap());
> } {code}
> The logic is ok, but I feel that the more reasonable logic should be return 
> an empty response instead of NPE, because InfoServer will be started before 
> initBlockPool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16658) BlockManager should output some logs when logEveryBlock is true.

2022-07-28 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16658.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander].

> BlockManager should output some logs when logEveryBlock is true.
> 
>
> Key: HDFS-16658
> URL: https://issues.apache.org/jira/browse/HDFS-16658
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> During locating some abnormal cases in our prod environment, I found that 
> BlockManager does not out put some logs in `addStoredBlock` even though 
> `logEveryBlock` is true.
> I feel that we need to change the log level from DEBUG to INFO.
> {code:java}
> // Some comments here
> private Block addStoredBlock(final BlockInfo block,
>final Block reportedBlock,
>DatanodeStorageInfo storageInfo,
>DatanodeDescriptor delNodeHint,
>boolean logEveryBlock)
>   throws IOException {
> 
>   if (logEveryBlock) {
> blockLog.debug("BLOCK* addStoredBlock: {} is added to {} (size={})",
> node, storedBlock, storedBlock.getNumBytes());
>   }
> ...
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16655) OIV: print out erasure coding policy name in oiv Delimited output

2022-07-25 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16655.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~max2049] for your contributions.

> OIV: print out erasure coding policy name in oiv Delimited output
> -
>
> Key: HDFS-16655
> URL: https://issues.apache.org/jira/browse/HDFS-16655
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> By adding erasure coding policy name to oiv output, it will help with oiv 
> post-analysis to have a overview of all folders/files with specified ec 
> policy and to apply internal regulation based on this information. In 
> particular, it wiil be convenient for the platform to calculate the real 
> storage size of the ec file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16600) Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.

2022-06-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16600.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander] for your works!

> Fix deadlock of fine-grain lock for FsDatastImpl of DataNode.
> -
>
> Key: HDFS-16600
> URL: https://issues.apache.org/jira/browse/HDFS-16600
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> The UT 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.testSynchronousEviction 
> failed, because happened deadlock, which  is introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. 
> DeadLock:
> {code:java}
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.createRbw line 1588 
> need a read lock
> try (AutoCloseableLock lock = lockManager.readLock(LockLevel.BLOCK_POOl,
> b.getBlockPoolId()))
> // org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.evictBlocks line 
> 3526 need a write lock
> try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16628) RBF: Correct target directory when move to trash for kerberos login user.

2022-06-15 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16628.

Hadoop Flags: Reviewed
Target Version/s: 3.4.0
  Resolution: Fixed

Committed to trunk. Thanks [~zhangxiping].

> RBF: Correct target directory when move to trash for kerberos login user.
> -
>
> Key: HDFS-16628
> URL: https://issues.apache.org/jira/browse/HDFS-16628
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Xiping Zhang
>Assignee: Xiping Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> remove data from the router will fail using such a user 
> username/d...@hadoop.com



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16598) Fix DataNode FsDatasetImpl lock issue without GS checks.

2022-06-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16598.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~xuzq_zander] for your works!

> Fix DataNode FsDatasetImpl lock issue without GS checks.
> 
>
> Key: HDFS-16598
> URL: https://issues.apache.org/jira/browse/HDFS-16598
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> org.apache.hadoop.hdfs.testPipelineRecoveryOnRestartFailure failed with the 
> stack like:
> {code:java}
> java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[127.0.0.1:57448,DS-1b5f7e33-a2bf-4edc-9122-a74c995a99f5,DISK]]
>  are bad. Aborting...
>   at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1667)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1601)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
>   at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
>   at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> After tracing the root cause, this bug was introduced by 
> [HDFS-16534|https://issues.apache.org/jira/browse/HDFS-16534]. Because the 
> block GS of client may be smaller than DN when pipeline recovery failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16609) Fix Flakes Junit Tests that often report timeouts

2022-06-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16609.

Hadoop Flags: Reviewed
  Resolution: Fixed

Committed to trunk. Thanks [~slfan1989].

> Fix Flakes Junit Tests that often report timeouts
> -
>
> Key: HDFS-16609
> URL: https://issues.apache.org/jira/browse/HDFS-16609
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I was dealing with HDFS-16590 JIRA, Junit Tests often reported errors, I 
> found that one type of problem is TimeOut problem, these problems can be 
> avoided by adjusting TimeOut time.
> The modified method is as follows:
> 1.org.apache.hadoop.hdfs.TestFileCreation#testServerDefaultsWithMinimalCaching
> {code:java}
> [ERROR] 
> testServerDefaultsWithMinimalCaching(org.apache.hadoop.hdfs.TestFileCreation) 
>  Time elapsed: 7.136 s  <<< ERROR!
> java.util.concurrent.TimeoutException: 
> Timed out waiting for condition. 
> Thread diagnostics: 
> [WARNING] 
> org.apache.hadoop.hdfs.TestFileCreation.testServerDefaultsWithMinimalCaching(org.apache.hadoop.hdfs.TestFileCreation)
> [ERROR]   Run 1: TestFileCreation.testServerDefaultsWithMinimalCaching:277 
> Timeout Timed out ...
> [INFO]   Run 2: PASS{code}
> 2.org.apache.hadoop.hdfs.TestDFSShell#testFilePermissions
> {code:java}
> [ERROR] testFilePermissions(org.apache.hadoop.hdfs.TestDFSShell)  Time 
> elapsed: 30.022 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 3 
> milliseconds
>   at java.lang.Thread.dumpThreads(Native Method)
>   at java.lang.Thread.getStackTrace(Thread.java:1549)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.createTimeoutException(FailOnTimeout.java:182)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.getResult(FailOnTimeout.java:177)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout.evaluate(FailOnTimeout.java:128)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:748)
> [WARNING] 
> org.apache.hadoop.hdfs.TestDFSShell.testFilePermissions(org.apache.hadoop.hdfs.TestDFSShell)
> [ERROR]   Run 1: TestDFSShell.testFilePermissions TestTimedOut test timed out 
> after 3 mil...
> [INFO]   Run 2: PASS {code}
> 3.org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier#testSPSWhenFileHasExcessRedundancyBlocks
> {code:java}
> [ERROR] 
> testSPSWhenFileHasExcessRedundancyBlocks(org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier)
>   Time elapsed: 67.904 s  <<< ERROR!
> java.util.concurrent.TimeoutException: 
> Timed out waiting for condition. 
> [WARNING] 
> org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks(org.apache.hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier)
> [ERROR]   Run 1: 
> TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks:1379
>  Timeout
> [ERROR]   Run 2: 
> TestExternalStoragePolicySatisfier.testSPSWhenFileHasExcessRedundancyBlocks:1379
>  Timeout
> [INFO]   Run 3: PASS {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16627) Improve BPServiceActor#register log to add NameNode address

2022-06-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16627.

Hadoop Flags: Reviewed
  Resolution: Fixed

Committed to trunk. Thanks [~slfan1989] for your contributions.

> Improve BPServiceActor#register log to add NameNode address
> ---
>
> Key: HDFS-16627
> URL: https://issues.apache.org/jira/browse/HDFS-16627
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When I read the log, I think the Addr information of NN should be added to 
> make the log information more complete.
> The log is as follows:
> {code:java}
> 2022-06-06 06:15:32,715 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(819)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> beginning handshake with NN.
> 2022-06-06 06:15:32,717 [BP-1990954485-172.17.0.2-1654496132136 heartbeating 
> to localhost/127.0.0.1:42811] INFO  datanode.DataNode 
> (BPServiceActor.java:register(847)) - Block pool 
> BP-1990954485-172.17.0.2-1654496132136 (Datanode Uuid 
> 7d4b5459-6f2b-4203-bf6f-d31bfb9b6c3f) service to localhost/127.0.0.1:42811 
> successfully registered with NN. {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16588) Backport HDFS-16584 to branch-3.3.

2022-05-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16588.

Fix Version/s: 3.3.4
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to branch-3.3

> Backport HDFS-16584 to branch-3.3.
> --
>
> Key: HDFS-16588
> URL: https://issues.apache.org/jira/browse/HDFS-16588
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.4
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This issue has been dealt with in trunk and again needs to be backported to 
> branch-3.3 or another active branch.
> See HDFS-16584.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16584) Record StandbyNameNode information when Balancer is running

2022-05-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16584.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~jianghuazhu] for your contribution!

> Record StandbyNameNode information when Balancer is running
> ---
>
> Key: HDFS-16584
> URL: https://issues.apache.org/jira/browse/HDFS-16584
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2022-05-19-20-23-23-825.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When the Balancer is running, we allow block data to be fetched from the 
> StandbyNameNode, which is nice. Here are some logs:
>  !image-2022-05-19-20-23-23-825.png! 
> But we have no way of knowing which NameNode the request was made to. We 
> should log more detailed information, such as the host associated with the 
> StandbyNameNode.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16553) Fix checkstyle for the length of BlockManager construction method over limit.

2022-04-29 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16553.

   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s:   (was: 3.4.0)
  Resolution: Fixed

Committed to trunk. Thanks [~smarthan] for your contributions.

> Fix checkstyle for the length of BlockManager construction method over limit.
> -
>
> Key: HDFS-16553
> URL: https://issues.apache.org/jira/browse/HDFS-16553
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The length  of BlockManager construction method is 156 lines which is over 
> 150 limit for BlockManager, do refactor the method to fix the checkstyle.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16554) Remove unused configuration dfs.namenode.block.deletion.increment.

2022-04-26 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16554.

   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s:   (was: 3.4.0)
  Resolution: Fixed

Committed to trunk. Thanks [~smarthan] for your contribution!

> Remove unused configuration dfs.namenode.block.deletion.increment. 
> ---
>
> Key: HDFS-16554
> URL: https://issues.apache.org/jira/browse/HDFS-16554
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The configuration *_dfs.namenode.block.deletion.increment_* will not be used 
> after the feature HDFS-16043 that do block deletetion asynchronously. So it's 
> better to remove it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16500) Make asynchronous blocks deletion lock and unlock durtion threshold configurable

2022-04-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16500.

   Fix Version/s: 3.4.0
Hadoop Flags: Reviewed
Target Version/s:   (was: 3.3.1, 3.3.2)
  Resolution: Fixed

Committed to trunk. Thanks [~smarthan] for your contributions.

> Make asynchronous blocks deletion lock and unlock durtion threshold 
> configurable 
> -
>
> Key: HDFS-16500
> URL: https://issues.apache.org/jira/browse/HDFS-16500
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Chengwei Wang
>Assignee: Chengwei Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I have backport the nice feature HDFS-16043 to our internal branch, it works 
> well in our testing cluster.
> I think it's better to make the fields *_deleteBlockLockTimeMs_* and 
> *_deleteBlockUnlockIntervalTimeMs_* configurable, so that we can control the 
> lock and unlock duration.
> {code:java}
> private final long deleteBlockLockTimeMs = 500;
> private final long deleteBlockUnlockIntervalTimeMs = 100;{code}
> And we should set the default value smaller to avoid blocking other requests 
> long time when deleting some  large directories.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16531) Avoid setReplication logging an edit record if old replication equals the new value

2022-04-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16531.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~sodonnell] for your contributions.

> Avoid setReplication logging an edit record if old replication equals the new 
> value
> ---
>
> Key: HDFS-16531
> URL: https://issues.apache.org/jira/browse/HDFS-16531
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I recently came across a NN log where about 800k setRep calls were made, 
> setting the replication from 3 to 3 - ie leaving it unchanged.
> Even in a case like this, we log an edit record, an audit log, and perform 
> some quota checks etc.
> I believe it should be possible to avoid some of the work if we check for 
> oldRep == newRep and jump out of the method early.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16534) Split datanode block pool locks to volume grain.

2022-04-17 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16534.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~Aiphag0] for your works.

> Split datanode block pool locks to volume grain.
> 
>
> Key: HDFS-16534
> URL: https://issues.apache.org/jira/browse/HDFS-16534
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
>  This is sub task of HDFS-15382. 
> https://issues.apache.org/jira/browse/HDFS-15180 have split lock to block 
> pool grain and do some prepare.This pr is the last part of volume lock.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16509) Fix decommission UnsupportedOperationException: Remove unsupported

2022-04-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16509.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~cndaimin] for your contributions.

> Fix decommission UnsupportedOperationException: Remove unsupported
> --
>
> Key: HDFS-16509
> URL: https://issues.apache.org/jira/browse/HDFS-16509
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.1, 3.3.2
>Reporter: daimin
>Assignee: daimin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> We encountered an "UnsupportedOperationException: Remove unsupported" error 
> when some datanodes were in decommission. The reason of the exception is that 
> datanode.getBlockIterator() returns an Iterator does not support remove, 
> however DatanodeAdminDefaultMonitor#processBlocksInternal invokes it.remove() 
> when a block not found, e.g, the file containing the block is deleted.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16516) fix filesystemshell wrong params

2022-04-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16516.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~philipse] for your contributions.

> fix filesystemshell wrong params
> 
>
> Key: HDFS-16516
> URL: https://issues.apache.org/jira/browse/HDFS-16516
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.3.2
>Reporter: guophilipse
>Assignee: guophilipse
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Fix wrong param name in FileSystemShell



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16498) Fix NPE for checkBlockReportLease

2022-03-30 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16498.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks @tomscut for your contribution!

> Fix NPE for checkBlockReportLease
> -
>
> Key: HDFS-16498
> URL: https://issues.apache.org/jira/browse/HDFS-16498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2022-03-09-20-35-22-028.png, screenshot-1.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> During the restart of Namenode, a Datanode is not registered, but this 
> Datanode triggers FBR, which causes NPE.
> !image-2022-03-09-20-35-22-028.png|width=871,height=158!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15987) Improve oiv tool to parse fsimage file in parallel with delimited format

2022-03-22 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-15987.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~wanghongbing] for your contributions! 

> Improve oiv tool to parse fsimage file in parallel with delimited format
> 
>
> Key: HDFS-15987
> URL: https://issues.apache.org/jira/browse/HDFS-15987
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: Improve_oiv_tool_001.pdf
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> The purpose of this Jira is to improve oiv tool to parse fsimage file with 
> sub-sections (see -HDFS-14617-) in parallel with delmited format. 
> 1.Serial parsing is time-consuming
> The time to serially parse a large fsimage with delimited format (e.g. `hdfs 
> oiv -p Delimited -t  ...`) is as follows: 
> {code:java}
> 1) Loading string table: -> Not time consuming.
> 2) Loading inode references: -> Not time consuming
> 3) Loading directories in INode section: -> Slightly time consuming (3%)
> 4) Loading INode directory section:  -> A bit time consuming (11%)
> 5) Output:   -> Very time consuming (86%){code}
> Therefore, output is the most parallelized stage.
> 2.How to output in parallel
> The sub-sections are grouped in order, and each thread processes a group and 
> outputs it to the file corresponding to each thread, and finally merges the 
> output files.
> 3. The result of a test
> {code:java}
>  input fsimage file info:
>  3.4G, 12 sub-sections, 55976500 INodes
>  -
>  Threads TotalTime OutputTime MergeTime
>  1   18m37s 16m18s  –
>  48m7s  4m49s   41s{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16504) Add parameter for NameNode to process getBloks request

2022-03-20 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16504.

Resolution: Fixed

> Add parameter for NameNode to process getBloks request
> --
>
> Key: HDFS-16504
> URL: https://issues.apache.org/jira/browse/HDFS-16504
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namanode
>Affects Versions: 3.4.0
>Reporter: Max  Xie
>Assignee: Max  Xie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> HDFS-13183  add a nice feature that Standby NameNode can process getBlocks 
> request to reduce Active load.  Namenode must set  `dfs.ha.allow.stale.reads 
> = true` to enable this feature. However, if we  set `dfs.ha.allow.stale.reads 
> = true`, Standby Namenode will be able to  process all read requests, which 
> may lead to yarn jobs fail  because  Standby Namenode is stale . 
> Maybe we should add a config `dfs.namenode.get-blocks.check.operation=false` 
> for namenode to disable check operation  when namenode process getBlocks 
> request.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16494) Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()

2022-03-16 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16494.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~jianghuazhu] for your contributions.

> Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()
> ---
>
> Key: HDFS-16494
> URL: https://issues.apache.org/jira/browse/HDFS-16494
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.9.2, 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When building the AvailableSpaceVolumeChoosingPolicy, if the default 
> constructor is used, initLocks() will be used twice, which is actually 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15382) Split one FsDatasetImpl lock to block pool grain locks.

2022-03-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-15382.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~Aiphag0]  for your works and thanks all (many to 
not mention everyone) for your suggestions.

> Split one FsDatasetImpl lock to block pool grain locks.
> ---
>
> Key: HDFS-15382
> URL: https://issues.apache.org/jira/browse/HDFS-15382
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, 
> image-2020-06-03-1.png
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> In HDFS-15180 we split lock to blockpool grain size.But when one volume is in 
> heavy load and will block other request which in same blockpool but different 
> volume.So we split lock to two leval to avoid this happend.And to improve 
> datanode performance.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16429) Add DataSetLockManager to manage fine-grain locks for FsDataSetImpl

2022-01-27 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16429.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~Aiphag0] for your contributions!

> Add DataSetLockManager to manage fine-grain locks for FsDataSetImpl
> ---
>
> Key: HDFS-16429
> URL: https://issues.apache.org/jira/browse/HDFS-16429
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Mingxiang Li
>Assignee: Mingxiang Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> 1、Use lockManager to maintain two level lock for FsDataSetImpl.
> The simple lock model like this.Parts of implemented as follows
>  * As for finalizeReplica(),append(),createRbw()First get BlockPoolLock 
> read lock,and then get BlockPoolLock-volume-lock write lock.
>  * As for getStoredBlock(),getMetaDataInputStream()First get 
> BlockPoolLock read lock,and the then get BlockPoolLock-volume-lock read lock.
>  * As for deepCopyReplica(),getBlockReports() get the BlockPoolLock read lock.
>  * As for delete hold the BlockPoolLock write lock.
> 2、Make LightWeightResizableGSet become thread safe.It not become performance 
> bottleneck if we make it thread safe.We can reduce lock grain size for 
> ReplicaMap when make LightWeightResizableGSet thread safe.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16428) Source path with storagePolicy cause wrong typeConsumed while rename

2022-01-24 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16428.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~lei w] for your contributions!

> Source path with storagePolicy cause wrong typeConsumed while rename
> 
>
> Key: HDFS-16428
> URL: https://issues.apache.org/jira/browse/HDFS-16428
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: example.txt
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When compute quota in rename operation , we use storage policy of the target 
> directory to compute src  quota usage. This will cause wrong value of 
> typeConsumed when source path was setted storage policy. I provided a unit 
> test to present this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16402) Improve HeartbeatManager logic to avoid incorrect stats

2022-01-23 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16402.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~tomscut] for your reports and contributions!

> Improve HeartbeatManager logic to avoid incorrect stats
> ---
>
> Key: HDFS-16402
> URL: https://issues.apache.org/jira/browse/HDFS-16402
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2021-12-29-08-25-44-303.png, 
> image-2021-12-29-08-25-54-441.png
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> After reconfig {*}dfs.datanode.data.dir{*}, we found that the stats of the 
> Namenode Web became *negative* and there were many NPE in namenode logs. This 
> problem has been solved by HDFS-14042.
> !image-2021-12-29-08-25-54-441.png|width=681,height=293!
> !image-2021-12-29-08-25-44-303.png|width=677,height=180!
> However, if *HeartbeatManager#updateHeartbeat* and 
> *HeartbeatManager#updateLifeline* throw other exceptions, stats errors can 
> also occur. We should ensure that *stats.subtract()* and *stats.add()* are 
> transactional.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously

2022-01-12 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16043.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Will cherry-pick to other active branches if no explicit 
conflict.

> Add markedDeleteBlockScrubberThread to delete blocks asynchronously
> ---
>
> Key: HDFS-16043
> URL: https://issues.apache.org/jira/browse/HDFS-16043
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namanode
>Affects Versions: 3.4.0
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: 20210527-after.svg, 20210527-before.svg
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> The deletion of the large directory caused NN to hold the lock for too long, 
> which caused our NameNode to be killed by ZKFC.
>  Through the flame graph, it is found that its main time-consuming 
> calculation is QuotaCount when removingBlocks(toRemovedBlocks) and deleting 
> inodes, and removeBlocks(toRemovedBlocks) takes a higher proportion of time.
> h3. solution:
> 1. RemoveBlocks is processed asynchronously. A thread is started in the 
> BlockManager to process the deleted blocks and control the lock time.
>  2. QuotaCount calculation optimization, this is similar to the optimization 
> of this Issue HDFS-16000.
> h3. Comparison before and after optimization:
> Delete 1000w Inode and 1000w block test.
>  *before:*
> remove inode elapsed time: 7691 ms
>  remove block elapsed time :11107 ms
>  *after:*
>  remove inode elapsed time: 4149 ms
>  remove block elapsed time :0 ms



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16418) task

2022-01-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16418.

Resolution: Invalid

Close as it is invalid JIRA.
[~sundasibrar] Please feel free to reopen it if offer more issue information.

> task
> 
>
> Key: HDFS-16418
> URL: https://issues.apache.org/jira/browse/HDFS-16418
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: sundas khann
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16404) Fix typo for CachingGetSpaceUsed

2022-01-09 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16404.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~tomscut] for your report and fix!

> Fix typo for CachingGetSpaceUsed
> 
>
> Key: HDFS-16404
> URL: https://issues.apache.org/jira/browse/HDFS-16404
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Fix typo for CachingGetSpaceUsed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16352) return the real datanode numBlocks in #getDatanodeStorageReport

2021-12-16 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16352.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~qinyuren] for your report and contribution!

> return the real datanode numBlocks in #getDatanodeStorageReport
> ---
>
> Key: HDFS-16352
> URL: https://issues.apache.org/jira/browse/HDFS-16352
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: qinyuren
>Assignee: qinyuren
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2021-11-23-22-04-06-131.png
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> #getDatanodeStorageReport will return the array of DatanodeStorageReport 
> which contains the DatanodeInfo in each DatanodeStorageReport, but the 
> numBlocks in DatanodeInfo is always zero, which is confusing
> !image-2021-11-23-22-04-06-131.png|width=683,height=338!
> Or we can return the real numBlocks in DatanodeInfo when we call 
> #getDatanodeStorageReport



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16378) Add datanode address to BlockReportLeaseManager logs

2021-12-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16378.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~tomscut] for your report and contributions!

> Add datanode address to BlockReportLeaseManager logs
> 
>
> Key: HDFS-16378
> URL: https://issues.apache.org/jira/browse/HDFS-16378
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2021-12-11-09-58-59-494.png
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> We should add datanode address to BlockReportLeaseManager logs. Because the 
> datanodeuuid is not convenient for tracking.
> !image-2021-12-11-09-58-59-494.png|width=643,height=152!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14997) BPServiceActor processes commands from NameNode asynchronously

2021-09-14 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reopened HDFS-14997:


> BPServiceActor processes commands from NameNode asynchronously
> --
>
> Key: HDFS-14997
> URL: https://issues.apache.org/jira/browse/HDFS-14997
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14997-branch-3.2.001.patch, HDFS-14997.001.patch, 
> HDFS-14997.002.patch, HDFS-14997.003.patch, HDFS-14997.004.patch, 
> HDFS-14997.005.patch, HDFS-14997.addendum.patch, 
> image-2019-12-26-16-15-44-814.png
>
>
> There are two core functions, report(#sendHeartbeat, #blockReport, 
> #cacheReport) and #processCommand in #BPServiceActor main process flow. If 
> processCommand cost long time it will block send report flow. Meanwhile 
> processCommand could cost long time(over 1000s the worst case I meet) when IO 
> load  of DataNode is very high. Since some IO operations are under 
> #datasetLock, So it has to wait to acquire #datasetLock long time when 
> process some of commands(such as #DNA_INVALIDATE). In such case, #heartbeat 
> will not send to NameNode in-time, and trigger other disasters.
> I propose to improve #processCommand asynchronously and not block 
> #BPServiceActor to send heartbeat back to NameNode when meet high IO load.
> Notes:
> 1. Lifeline could be one effective solution, however some old branches are 
> not support this feature.
> 2. IO operations under #datasetLock is another issue, I think we should solve 
> it at another JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16223) AvailableSpaceRackFaultTolerantBlockPlacementPolicy should use chooseRandomWithStorageTypeTwoTrial() for better performance.

2021-09-13 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-16223.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thanks [~ayushtkn] for your works!

> AvailableSpaceRackFaultTolerantBlockPlacementPolicy should use 
> chooseRandomWithStorageTypeTwoTrial() for better performance.
> 
>
> Key: HDFS-16223
> URL: https://issues.apache.org/jira/browse/HDFS-16223
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Use chooseRandomWithStorageTypeTwoTrial as AvailableSpaceBlockPlacementPolicy 
> does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15160) ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl methods should use datanode readlock

2021-09-11 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He resolved HDFS-15160.

Fix Version/s: 3.2.3
   Resolution: Fixed

> ReplicaMap, Disk Balancer, Directory Scanner and various FsDatasetImpl 
> methods should use datanode readlock
> ---
>
> Key: HDFS-15160
> URL: https://issues.apache.org/jira/browse/HDFS-15160
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.1
>
> Attachments: HDFS-15160-branch-3.3-001.patch, HDFS-15160.001.patch, 
> HDFS-15160.002.patch, HDFS-15160.003.patch, HDFS-15160.004.patch, 
> HDFS-15160.005.patch, HDFS-15160.006.patch, HDFS-15160.007.patch, 
> HDFS-15160.008.patch, HDFS-15160.branch-3-3.001.patch, 
> image-2020-04-10-17-18-08-128.png, image-2020-04-10-17-18-55-938.png
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Now we have HDFS-15150, we can start to move some DN operations to use the 
> read lock rather than the write lock to improve concurrence. The first step 
> is to make the changes to ReplicaMap, as many other methods make calls to it.
> This Jira switches read operations against the volume map to use the readLock 
> rather than the write lock.
> Additionally, some methods make a call to replicaMap.replicas() (eg 
> getBlockReports, getFinalizedBlocks, deepCopyReplica) and only use the result 
> in a read only fashion, so they can also be switched to using a readLock.
> Next is the directory scanner and disk balancer, which only require a read 
> lock.
> Finally (for this Jira) are various "low hanging fruit" items in BlockSender 
> and fsdatasetImpl where is it fairly obvious they only need a read lock.
> For now, I have avoided changing anything which looks too risky, as I think 
> its better to do any larger refactoring or risky changes each in their own 
> Jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



  1   2   >