[jira] [Commented] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808541#comment-17808541
 ] 

ASF GitHub Bot commented on HDFS-17343:
---

ayushtkn commented on PR #6457:
URL: https://github.com/apache/hadoop/pull/6457#issuecomment-1899971792

   Revert doesn't need approval, just revert and remove 3.4.0 from the fix 
version in the original ticket and mention it has been reverted as part of this 
ticket for this reason




> Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
> -
>
> Key: HDFS-17343
> URL: https://issues.apache.org/jira/browse/HDFS-17343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>
> When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause 
> mis-order of ibr and fbr on datanode. After discussion, we decided to revert 
> HDFS-16016.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808549#comment-17808549
 ] 

ASF GitHub Bot commented on HDFS-17311:
---

hadoop-yetus commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-182973

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 48s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  47m  0s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 40s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  38m  6s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 18s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 28s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  38m  0s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m  2s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 161m 15s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6392/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6392 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux b5fd50417e95 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 1867ce2130ea54cc7de7aaa03527b8ce601638a9 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6392/4/testReport/ |
   | Max. process+thread count | 2371 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6392/4/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> RBF: ConnectionManager creatorQueue should offer a pool that is no

[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808553#comment-17808553
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

haiyang1987 commented on PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#issuecomment-196775

   Hi @zhangshuyan0 thanks for your review .
   Fixed the issue of UT failure, please help review it again, thanks~




> Fix DataNode may invalidates normal block causing missing block
> ---
>
> Key: HDFS-17342
> URL: https://issues.apache.org/jira/browse/HDFS-17342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17344) Last packet will be splited into two parts when write block

2024-01-19 Thread farmmamba (Jira)
farmmamba created HDFS-17344:


 Summary: Last packet will be splited into two parts when write 
block
 Key: HDFS-17344
 URL: https://issues.apache.org/jira/browse/HDFS-17344
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs-client
Affects Versions: 3.5.0
Reporter: farmmamba
Assignee: farmmamba


As mentioned in 
[https://github.com/apache/hadoop/pull/6368#issuecomment-1899635293]

This Jira  try to solve that problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808640#comment-17808640
 ] 

ASF GitHub Bot commented on HDFS-17343:
---

tasanuma commented on PR #6457:
URL: https://github.com/apache/hadoop/pull/6457#issuecomment-1900353462

   Creating another ticket is my request, as I commented 
[here](https://github.com/apache/hadoop/pull/6457#issuecomment-1895185176). We 
also want to revert HDFS-16016 from branch-3.3, but we cannot remove 3.3.6 from 
the fix version. So, if it has been released once or more, I would request 
another JIRA for reverting it.




> Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
> -
>
> Key: HDFS-17343
> URL: https://issues.apache.org/jira/browse/HDFS-17343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>
> When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause 
> mis-order of ibr and fbr on datanode. After discussion, we decided to revert 
> HDFS-16016.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808642#comment-17808642
 ] 

ASF GitHub Bot commented on HDFS-17293:
---

hfutatzhanghb commented on PR #6368:
URL: https://github.com/apache/hadoop/pull/6368#issuecomment-1900355060

   @zhangshuyan0 Sir, Have modified according to your review opinions. Please 
take a look when you have free time. Thanks a lot~




> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808651#comment-17808651
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

smarthanwang commented on code in PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#discussion_r1458970508


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java:
##
@@ -2011,4 +2011,83 @@ public void tesInvalidateMissingBlock() throws Exception 
{
   cluster.shutdown();
 }
   }
+
+  @Test
+  public void testCheckFilesWhenInvalidateMissingBlock() throws Exception {
+long blockSize = 1024;
+int heartbeatInterval = 1;
+HdfsConfiguration c = new HdfsConfiguration();
+c.setInt(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, heartbeatInterval);
+c.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster cluster = new MiniDFSCluster.Builder(c).
+numDataNodes(1).build();
+DataNodeFaultInjector oldDnInjector = DataNodeFaultInjector.get();
+try {
+  cluster.waitActive();
+  BlockReaderTestUtil util = new BlockReaderTestUtil(cluster, new
+  HdfsConfiguration(conf));
+  Path path = new Path("/testFile");
+  util.writeFile(path, 1);
+  String bpid = cluster.getNameNode().getNamesystem().getBlockPoolId();
+  DataNode dn = cluster.getDataNodes().get(0);
+  FsDatasetImpl dnFSDataset = (FsDatasetImpl) dn.getFSDataset();
+  List replicaInfos = dnFSDataset.getFinalizedBlocks(bpid);
+  assertEquals(1, replicaInfos.size());
+  DFSTestUtil.readFile(cluster.getFileSystem(), path);
+  LocatedBlock blk = util.getFileBlocks(path, 512).get(0);
+  ExtendedBlock block = blk.getBlock();
+
+  // Append a new block with an incremented generation stamp.
+  long newGS = block.getGenerationStamp() + 1;
+  dnFSDataset.append(block, newGS, 1024);
+  block.setGenerationStamp(newGS);
+
+  DataNodeFaultInjector injector = new DataNodeFaultInjector() {
+@Override
+public void delayGetMetaDataInputStream() {
+  try {
+Thread.sleep(8000);
+  } catch (InterruptedException e) {
+// Ignore exception.
+  }
+}
+  };
+  // Delay to getMetaDataInputStream.
+  DataNodeFaultInjector.set(injector);
+
+  ExecutorService executorService = Executors.newFixedThreadPool(2);
+  try {
+Future blockReaderFuture = executorService.submit(() -> {
+  try {
+// Submit tasks for reading block.
+BlockReaderTestUtil.getBlockReader(cluster.getFileSystem(), blk, 
0, 512);

Review Comment:
Shoud we check FNE's thrown as expect here?



##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java:
##
@@ -2416,11 +2419,21 @@ public void invalidateMissingBlock(String bpid, Block 
block) {
 // So remove if from volume map notify namenode is ok.
 try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl,
 bpid)) {
-  ReplicaInfo replica = volumeMap.remove(bpid, block);
-  invalidate(bpid, replica);
+  // Check if this block is on the volume map.
+  ReplicaInfo replica = volumeMap.get(bpid, block);
+  // Double-check block or meta file existence when checkFiles as true.
+  if (replica != null && (!checkFiles ||
+  (!replica.blockDataExists() || !replica.metadataExists( {
+volumeMap.remove(bpid, block);
+invalidate(bpid, replica);

Review Comment:
   If replica == null,` invalidate(bpid, replica);` would not execute



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java:
##
@@ -2011,4 +2011,83 @@ public void tesInvalidateMissingBlock() throws Exception 
{
   cluster.shutdown();
 }
   }
+
+  @Test
+  public void testCheckFilesWhenInvalidateMissingBlock() throws Exception {
+long blockSize = 1024;
+int heartbeatInterval = 1;
+HdfsConfiguration c = new HdfsConfiguration();
+c.setInt(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, heartbeatInterval);
+c.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster cluster = new MiniDFSCluster.Builder(c).
+numDataNodes(1).build();
+DataNodeFaultInjector oldDnInjector = DataNodeFaultInjector.get();
+try {
+  cluster.waitActive();
+  BlockReaderTestUtil util = new BlockReaderTestUtil(cluster, new
+  HdfsConfiguration(conf));
+  Path path = new Path("/testFile");
+  util.writeFile(path, 1);
+  String bpid = cluster.getNameNode().getNamesystem().getBlockPoolId();
+  DataNode dn = cluster.getDataNodes().get(0);
+  FsDatasetImpl dnFSDataset = (FsDatasetImpl) dn.getFSDataset();
+  List replicaInfos = dnFSDataset.getFinalizedBlock

[jira] [Commented] (HDFS-10224) Implement asynchronous rename for DistributedFileSystem

2024-01-19 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808697#comment-17808697
 ] 

farmmamba commented on HDFS-10224:
--

[~szetszwo] [~xiaobingo] Hi, sir. Could you please tell me why i can not find 
the class AsyncDistributedFileSystem in current trunk branch? Thanks a lot.

 

!image-2024-01-19-23-06-32-901.png!

> Implement asynchronous rename for DistributedFileSystem
> ---
>
> Key: HDFS-10224
> URL: https://issues.apache.org/jira/browse/HDFS-10224
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: fs, hdfs-client
>Reporter: Xiaobing Zhou
>Assignee: Xiaobing Zhou
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-10224-HDFS-9924.000.patch, 
> HDFS-10224-HDFS-9924.001.patch, HDFS-10224-HDFS-9924.002.patch, 
> HDFS-10224-HDFS-9924.003.patch, HDFS-10224-HDFS-9924.004.patch, 
> HDFS-10224-HDFS-9924.005.patch, HDFS-10224-HDFS-9924.006.patch, 
> HDFS-10224-HDFS-9924.007.patch, HDFS-10224-HDFS-9924.008.patch, 
> HDFS-10224-HDFS-9924.009.patch, HDFS-10224-and-HADOOP-12909.000.patch
>
>
> This is proposed to implement an asynchronous rename.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10224) Implement asynchronous rename for DistributedFileSystem

2024-01-19 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-10224:
-
Attachment: image-2024-01-19-23-06-32-901.png

> Implement asynchronous rename for DistributedFileSystem
> ---
>
> Key: HDFS-10224
> URL: https://issues.apache.org/jira/browse/HDFS-10224
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: fs, hdfs-client
>Reporter: Xiaobing Zhou
>Assignee: Xiaobing Zhou
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-10224-HDFS-9924.000.patch, 
> HDFS-10224-HDFS-9924.001.patch, HDFS-10224-HDFS-9924.002.patch, 
> HDFS-10224-HDFS-9924.003.patch, HDFS-10224-HDFS-9924.004.patch, 
> HDFS-10224-HDFS-9924.005.patch, HDFS-10224-HDFS-9924.006.patch, 
> HDFS-10224-HDFS-9924.007.patch, HDFS-10224-HDFS-9924.008.patch, 
> HDFS-10224-HDFS-9924.009.patch, HDFS-10224-and-HADOOP-12909.000.patch, 
> image-2024-01-19-23-06-32-901.png
>
>
> This is proposed to implement an asynchronous rename.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10224) Implement asynchronous rename for DistributedFileSystem

2024-01-19 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808737#comment-17808737
 ] 

Tsz-wo Sze commented on HDFS-10224:
---

[~zhanghaobo], it is unfortunate that this and the AsyncDistributedFileSystem 
class were removed by HDFS-10538.


> Implement asynchronous rename for DistributedFileSystem
> ---
>
> Key: HDFS-10224
> URL: https://issues.apache.org/jira/browse/HDFS-10224
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: fs, hdfs-client
>Reporter: Xiaobing Zhou
>Assignee: Xiaobing Zhou
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-10224-HDFS-9924.000.patch, 
> HDFS-10224-HDFS-9924.001.patch, HDFS-10224-HDFS-9924.002.patch, 
> HDFS-10224-HDFS-9924.003.patch, HDFS-10224-HDFS-9924.004.patch, 
> HDFS-10224-HDFS-9924.005.patch, HDFS-10224-HDFS-9924.006.patch, 
> HDFS-10224-HDFS-9924.007.patch, HDFS-10224-HDFS-9924.008.patch, 
> HDFS-10224-HDFS-9924.009.patch, HDFS-10224-and-HADOOP-12909.000.patch, 
> image-2024-01-19-23-06-32-901.png
>
>
> This is proposed to implement an asynchronous rename.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808749#comment-17808749
 ] 

ASF GitHub Bot commented on HDFS-17293:
---

hadoop-yetus commented on PR #6368:
URL: https://github.com/apache/hadoop/pull/6368#issuecomment-1900802717

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 38s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  19m 14s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   2m 52s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   2m 49s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 41s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 15s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m  5s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 19s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 19s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  2s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 48s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   2m 48s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 43s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   2m 43s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6368/11/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | -0 :warning: |  checkstyle  |   0m 35s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6368/11/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 2 new + 28 unchanged - 0 fixed = 
30 total (was 28)  |
   | +1 :green_heart: |  mvnsite  |   1m  5s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m  8s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 24s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 49s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 196m 56s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6368/11/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 301m  3s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestDFSShell |
   |   | hadoop.hdfs.TestReconstructStripedFile |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6368/11/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6368 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 11babc894195 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
  

[jira] [Updated] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-19 Thread Chris Trezzo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated HDFS-17332:

Fix Version/s: 3.5.0

> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> --
>
> Key: HDFS-17332
> URL: https://issues.apache.org/jira/browse/HDFS-17332
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 
>  
> Example stack trace
> {code:java}
> [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: 
> Failed to connect to 10.150.91.13/10.150.91.13:71 for file 
> //part--95b9909c-zzz-c000.avro for block 
> BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException:
>  6 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/ip:40492 
> remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 
> millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/localIp:40492 
> remote=datanodeIP:71] [12]: at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
> [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
> [12]: at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
>  [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
>  [12]: at 
> org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-19 Thread Chris Trezzo (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo resolved HDFS-17332.
-
Resolution: Fixed

> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> --
>
> Key: HDFS-17332
> URL: https://issues.apache.org/jira/browse/HDFS-17332
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 
>  
> Example stack trace
> {code:java}
> [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: 
> Failed to connect to 10.150.91.13/10.150.91.13:71 for file 
> //part--95b9909c-zzz-c000.avro for block 
> BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException:
>  6 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/ip:40492 
> remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 
> millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/localIp:40492 
> remote=datanodeIP:71] [12]: at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
> [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
> [12]: at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
>  [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
>  [12]: at 
> org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808811#comment-17808811
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

goiri merged PR #6380:
URL: https://github.com/apache/hadoop/pull/6380




> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, 
> HDFS-17302.003.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns. In addition, when I reconfigure the total number of handlers on 
> the router, I have to re-allocate handlers to each ns, which undoubtedly 
> increases the complexity of operation and maintenance.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are also under normal load, or are under higher load, we do not need to 
> isolate them, these ns of the same nature can share the handlers of the 
> router; The performance is better than assigning a fixed number of handlers 
> to each ns, because each ns can use all the handlers of the router.
> h2. New features
> Based on the above staticRouterRpcFairnessPol

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2024-01-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-17302:
---
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, 
> HDFS-17302.003.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns. In addition, when I reconfigure the total number of handlers on 
> the router, I have to re-allocate handlers to each ns, which undoubtedly 
> increases the complexity of operation and maintenance.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are also under normal load, or are under higher load, we do not need to 
> isolate them, these ns of the same nature can share the handlers of the 
> router; The performance is better than assigning a fixed number of handlers 
> to each ns, because each ns can use all the handlers of the router.
> h2. New features
> Based on the above sta

[jira] [Resolved] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR

2024-01-19 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved HDFS-17343.
---
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
> -
>
> Key: HDFS-17343
> URL: https://issues.apache.org/jira/browse/HDFS-17343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause 
> mis-order of ibr and fbr on datanode. After discussion, we decided to revert 
> HDFS-16016.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808834#comment-17808834
 ] 

ASF GitHub Bot commented on HDFS-17343:
---

slfan1989 merged PR #6457:
URL: https://github.com/apache/hadoop/pull/6457




> Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
> -
>
> Key: HDFS-17343
> URL: https://issues.apache.org/jira/browse/HDFS-17343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>
> When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause 
> mis-order of ibr and fbr on datanode. After discussion, we decided to revert 
> HDFS-16016.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808835#comment-17808835
 ] 

ASF GitHub Bot commented on HDFS-17343:
---

slfan1989 commented on PR #6457:
URL: https://github.com/apache/hadoop/pull/6457#issuecomment-1901335747

   @tasanuma @Hexiaoqiao @ayushtkn @virajjasani Thanks for reviewing the code! 
merged into trunk.




> Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
> -
>
> Key: HDFS-17343
> URL: https://issues.apache.org/jira/browse/HDFS-17343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause 
> mis-order of ibr and fbr on datanode. After discussion, we decided to revert 
> HDFS-16016.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808836#comment-17808836
 ] 

ASF GitHub Bot commented on HDFS-17311:
---

slfan1989 merged PR #6392:
URL: https://github.com/apache/hadoop/pull/6392




> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808838#comment-17808838
 ] 

ASF GitHub Bot commented on HDFS-17311:
---

slfan1989 commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1901344810

   @LiuGuH Thank you for your contribution! merged into trunk. @goiri Thank you 
for reviewing the code!




> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-19 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan resolved HDFS-17311.
---
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
  Resolution: Fixed

> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808839#comment-17808839
 ] 

ASF GitHub Bot commented on HDFS-17311:
---

slfan1989 commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1901346360

   > > @LiuGuH Thanks for the contribution! Can we trigger compilation again?
   > 
   > Thanks for review. Now triggered compilation. And I triggerd compilation 
with command "git commit --amend && git push -f ". Is there any other way to 
trigger compilation? Thanks
   
   We can trigger update branch on the page. 




> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-19 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated HDFS-17311:
--
Affects Version/s: 3.5.0

> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.5.0
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10224) Implement asynchronous rename for DistributedFileSystem

2024-01-19 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808845#comment-17808845
 ] 

farmmamba commented on HDFS-10224:
--

[~szetszwo] Sir, Thanks a lot ~

> Implement asynchronous rename for DistributedFileSystem
> ---
>
> Key: HDFS-10224
> URL: https://issues.apache.org/jira/browse/HDFS-10224
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: fs, hdfs-client
>Reporter: Xiaobing Zhou
>Assignee: Xiaobing Zhou
>Priority: Major
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: HDFS-10224-HDFS-9924.000.patch, 
> HDFS-10224-HDFS-9924.001.patch, HDFS-10224-HDFS-9924.002.patch, 
> HDFS-10224-HDFS-9924.003.patch, HDFS-10224-HDFS-9924.004.patch, 
> HDFS-10224-HDFS-9924.005.patch, HDFS-10224-HDFS-9924.006.patch, 
> HDFS-10224-HDFS-9924.007.patch, HDFS-10224-HDFS-9924.008.patch, 
> HDFS-10224-HDFS-9924.009.patch, HDFS-10224-and-HADOOP-12909.000.patch, 
> image-2024-01-19-23-06-32-901.png
>
>
> This is proposed to implement an asynchronous rename.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808860#comment-17808860
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

haiyang1987 commented on code in PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#discussion_r1460160197


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java:
##
@@ -2011,4 +2011,83 @@ public void tesInvalidateMissingBlock() throws Exception 
{
   cluster.shutdown();
 }
   }
+
+  @Test
+  public void testCheckFilesWhenInvalidateMissingBlock() throws Exception {
+long blockSize = 1024;
+int heartbeatInterval = 1;
+HdfsConfiguration c = new HdfsConfiguration();
+c.setInt(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, heartbeatInterval);
+c.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster cluster = new MiniDFSCluster.Builder(c).
+numDataNodes(1).build();
+DataNodeFaultInjector oldDnInjector = DataNodeFaultInjector.get();
+try {
+  cluster.waitActive();
+  BlockReaderTestUtil util = new BlockReaderTestUtil(cluster, new
+  HdfsConfiguration(conf));
+  Path path = new Path("/testFile");
+  util.writeFile(path, 1);
+  String bpid = cluster.getNameNode().getNamesystem().getBlockPoolId();
+  DataNode dn = cluster.getDataNodes().get(0);
+  FsDatasetImpl dnFSDataset = (FsDatasetImpl) dn.getFSDataset();
+  List replicaInfos = dnFSDataset.getFinalizedBlocks(bpid);
+  assertEquals(1, replicaInfos.size());
+  DFSTestUtil.readFile(cluster.getFileSystem(), path);
+  LocatedBlock blk = util.getFileBlocks(path, 512).get(0);
+  ExtendedBlock block = blk.getBlock();
+
+  // Append a new block with an incremented generation stamp.
+  long newGS = block.getGenerationStamp() + 1;
+  dnFSDataset.append(block, newGS, 1024);
+  block.setGenerationStamp(newGS);
+
+  DataNodeFaultInjector injector = new DataNodeFaultInjector() {
+@Override
+public void delayGetMetaDataInputStream() {
+  try {
+Thread.sleep(8000);
+  } catch (InterruptedException e) {
+// Ignore exception.
+  }
+}
+  };
+  // Delay to getMetaDataInputStream.
+  DataNodeFaultInjector.set(injector);
+
+  ExecutorService executorService = Executors.newFixedThreadPool(2);
+  try {
+Future blockReaderFuture = executorService.submit(() -> {
+  try {
+// Submit tasks for reading block.
+BlockReaderTestUtil.getBlockReader(cluster.getFileSystem(), blk, 
0, 512);

Review Comment:
   Thanks @smarthanwang for your review.





> Fix DataNode may invalidates normal block causing missing block
> ---
>
> Key: HDFS-17342
> URL: https://issues.apache.org/jira/browse/HDFS-17342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For a

[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808861#comment-17808861
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

haiyang1987 commented on code in PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#discussion_r1460160197


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java:
##
@@ -2011,4 +2011,83 @@ public void tesInvalidateMissingBlock() throws Exception 
{
   cluster.shutdown();
 }
   }
+
+  @Test
+  public void testCheckFilesWhenInvalidateMissingBlock() throws Exception {
+long blockSize = 1024;
+int heartbeatInterval = 1;
+HdfsConfiguration c = new HdfsConfiguration();
+c.setInt(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, heartbeatInterval);
+c.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster cluster = new MiniDFSCluster.Builder(c).
+numDataNodes(1).build();
+DataNodeFaultInjector oldDnInjector = DataNodeFaultInjector.get();
+try {
+  cluster.waitActive();
+  BlockReaderTestUtil util = new BlockReaderTestUtil(cluster, new
+  HdfsConfiguration(conf));
+  Path path = new Path("/testFile");
+  util.writeFile(path, 1);
+  String bpid = cluster.getNameNode().getNamesystem().getBlockPoolId();
+  DataNode dn = cluster.getDataNodes().get(0);
+  FsDatasetImpl dnFSDataset = (FsDatasetImpl) dn.getFSDataset();
+  List replicaInfos = dnFSDataset.getFinalizedBlocks(bpid);
+  assertEquals(1, replicaInfos.size());
+  DFSTestUtil.readFile(cluster.getFileSystem(), path);
+  LocatedBlock blk = util.getFileBlocks(path, 512).get(0);
+  ExtendedBlock block = blk.getBlock();
+
+  // Append a new block with an incremented generation stamp.
+  long newGS = block.getGenerationStamp() + 1;
+  dnFSDataset.append(block, newGS, 1024);
+  block.setGenerationStamp(newGS);
+
+  DataNodeFaultInjector injector = new DataNodeFaultInjector() {
+@Override
+public void delayGetMetaDataInputStream() {
+  try {
+Thread.sleep(8000);
+  } catch (InterruptedException e) {
+// Ignore exception.
+  }
+}
+  };
+  // Delay to getMetaDataInputStream.
+  DataNodeFaultInjector.set(injector);
+
+  ExecutorService executorService = Executors.newFixedThreadPool(2);
+  try {
+Future blockReaderFuture = executorService.submit(() -> {
+  try {
+// Submit tasks for reading block.
+BlockReaderTestUtil.getBlockReader(cluster.getFileSystem(), blk, 
0, 512);

Review Comment:
   Thanks @smarthanwang for your review.  
   here will not be thrown `java.io.FileNotFoundException` , because this 
exception will be captured output to the exception stack when processing the 
`DataXceiver.`
   
   if we expect check  `FileNotFoundException` maybe  need to call the 
initialization BlockSender directly.



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java:
##
@@ -2011,4 +2011,83 @@ public void tesInvalidateMissingBlock() throws Exception 
{
   cluster.shutdown();
 }
   }
+
+  @Test
+  public void testCheckFilesWhenInvalidateMissingBlock() throws Exception {
+long blockSize = 1024;
+int heartbeatInterval = 1;
+HdfsConfiguration c = new HdfsConfiguration();
+c.setInt(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, heartbeatInterval);
+c.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster cluster = new MiniDFSCluster.Builder(c).
+numDataNodes(1).build();
+DataNodeFaultInjector oldDnInjector = DataNodeFaultInjector.get();
+try {
+  cluster.waitActive();
+  BlockReaderTestUtil util = new BlockReaderTestUtil(cluster, new
+  HdfsConfiguration(conf));
+  Path path = new Path("/testFile");
+  util.writeFile(path, 1);
+  String bpid = cluster.getNameNode().getNamesystem().getBlockPoolId();
+  DataNode dn = cluster.getDataNodes().get(0);
+  FsDatasetImpl dnFSDataset = (FsDatasetImpl) dn.getFSDataset();
+  List replicaInfos = dnFSDataset.getFinalizedBlocks(bpid);
+  assertEquals(1, replicaInfos.size());
+  DFSTestUtil.readFile(cluster.getFileSystem(), path);
+  LocatedBlock blk = util.getFileBlocks(path, 512).get(0);
+  ExtendedBlock block = blk.getBlock();
+
+  // Append a new block with an incremented generation stamp.
+  long newGS = block.getGenerationStamp() + 1;
+  dnFSDataset.append(block, newGS, 1024);
+  block.setGenerationStamp(newGS);
+
+  DataNodeFaultInjector injector = new DataNodeFaultInjector() {
+@Override
+public void delayGetMetaDataInputStream() {
+  try {
+Thread.sleep(8000);
+  } catch (InterruptedException e) {
+