[jira] [Updated] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-18 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17342:
--
Component/s: datanode

> Fix DataNode may invalidates normal block causing missing block
> ---
>
> Key: HDFS-17342
> URL: https://issues.apache.org/jira/browse/HDFS-17342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-18 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17342:
--
Fix Version/s: 3.5.0

> Fix DataNode may invalidates normal block causing missing block
> ---
>
> Key: HDFS-17342
> URL: https://issues.apache.org/jira/browse/HDFS-17342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808502#comment-17808502
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

hadoop-yetus commented on PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1899876542

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 18s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  9s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 39s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 39s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 47s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 30s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 39s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 39s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 44s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 35s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 196m 24s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 27s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 283m 41s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.TestFileTruncate |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6464 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux c7f45b3093ae 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 303ea2116d8b3373a82a310bae480b33aedc15e0 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/4/testReport/ |
   | Max. process+thread count | 3961 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 

[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808496#comment-17808496
 ] 

ASF GitHub Bot commented on HDFS-17293:
---

hfutatzhanghb commented on code in PR #6368:
URL: https://github.com/apache/hadoop/pull/6368#discussion_r1458465700


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java:
##
@@ -184,6 +186,40 @@ public void testPreventOverflow() throws IOException, 
NoSuchFieldException,
 runAdjustChunkBoundary(configuredWritePacketSize, finalWritePacketSize);
   }
 
+  @Test(timeout=6)
+  public void testFirstPacketSizeInNewBlocks() throws IOException {
+final long blockSize = 1L * 1024 * 1024;
+final int numDataNodes = 3;
+final Configuration dfsConf = new Configuration();
+dfsConf.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster dfsCluster = null;
+dfsCluster = new 
MiniDFSCluster.Builder(dfsConf).numDataNodes(numDataNodes).build();
+dfsCluster.waitActive();
+
+DistributedFileSystem fs = dfsCluster.getFileSystem();
+Path fileName = new Path("/testfile.dat");
+FSDataOutputStream fos = fs.create(fileName);
+DataChecksum crc32c = 
DataChecksum.newDataChecksum(DataChecksum.Type.CRC32C, 512);
+
+long loop = 0;
+Random r = new Random();
+byte[] buf = new byte[1 * 1024 * 1024];
+r.nextBytes(buf);
+fos.write(buf);
+fos.hflush();
+
+while (loop < 20) {
+  r.nextBytes(buf);
+  fos.write(buf);
+  fos.hflush();
+  loop++;
+  Assert.assertNotEquals(crc32c.getBytesPerChecksum() + 
crc32c.getChecksumSize(),

Review Comment:
   Sir, thanks for this valuable suggestion. Will fix it soon.





> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808488#comment-17808488
 ] 

ASF GitHub Bot commented on HDFS-17293:
---

hfutatzhanghb commented on code in PR #6368:
URL: https://github.com/apache/hadoop/pull/6368#discussion_r1458453315


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java:
##
@@ -184,6 +186,40 @@ public void testPreventOverflow() throws IOException, 
NoSuchFieldException,
 runAdjustChunkBoundary(configuredWritePacketSize, finalWritePacketSize);
   }
 
+  @Test(timeout=6)
+  public void testFirstPacketSizeInNewBlocks() throws IOException {
+final long blockSize = 1L * 1024 * 1024;
+final int numDataNodes = 3;
+final Configuration dfsConf = new Configuration();
+dfsConf.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster dfsCluster = null;
+dfsCluster = new 
MiniDFSCluster.Builder(dfsConf).numDataNodes(numDataNodes).build();
+dfsCluster.waitActive();
+
+DistributedFileSystem fs = dfsCluster.getFileSystem();
+Path fileName = new Path("/testfile.dat");
+FSDataOutputStream fos = fs.create(fileName);
+DataChecksum crc32c = 
DataChecksum.newDataChecksum(DataChecksum.Type.CRC32C, 512);
+
+long loop = 0;
+Random r = new Random();
+byte[] buf = new byte[1 * 1024 * 1024];

Review Comment:
   Very nice suggestion, Thanks a lot sir. I will fix them laterly





> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-18 Thread farmmamba (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808478#comment-17808478
 ] 

farmmamba commented on HDFS-17311:
--

Can use “git commit —allow-empty”


张浩博
hfutzhan...@163.com


 Replied Message 

[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808474#comment-17808474
 ]

ASF GitHub Bot commented on HDFS-17311:
---

LiuGuH commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899821409

@LiuGuH Thanks for the contribution! Can we trigger compilation again?

Thanks for review. Now  triggered compilation.
And I triggerd compilation with command "git commit --amend && git push -f ".   
 Is there any other way to trigger  compilation? Thanks




RBF: ConnectionManager creatorQueue should offer a pool that is not already in 
creatorQueue.


Key: HDFS-17311
URL: https://issues.apache.org/jira/browse/HDFS-17311
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: liuguanghua
Assignee: liuguanghua
Priority: Major
Labels: pull-request-available

In the Router, find blow log
 
2023-12-29 15:18:54,799 ERROR 
org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
more than 2048 connections at the same time
 
The log indicates that ConnectionManager.creatorQueue is full at a certain 
point. But my cluster does not have so many users cloud reach up 2048 pair of 
.
This may be due to the following reasons:
# ConnectionManager.creatorQueue is a queue that will be offered ConnectionPool 
if ConnectionContext is not enough.
# ConnectionCreator thread will consume from creatorQueue and make more 
ConnectionContexts for a ConnectionPool.
# Client will concurrent invoke for ConnectionManager.getConnection() for a 
same user. And this maybe lead to add many same ConnectionPool into 
ConnectionManager.creatorQueue.
# When creatorQueue is full, a new ConnectionPool will not be added in 
successfully and log this error. This maybe lead to a really new ConnectionPool 
clould not produce more ConnectionContexts for new user.
So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org


> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808474#comment-17808474
 ] 

ASF GitHub Bot commented on HDFS-17311:
---

LiuGuH commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899821409

   > @LiuGuH Thanks for the contribution! Can we trigger compilation again?
   
Thanks for review. Now  triggered compilation. 
   And I triggerd compilation with command "git commit --amend && git push -f 
".Is there any other way to trigger  compilation? Thanks




> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808467#comment-17808467
 ] 

ASF GitHub Bot commented on HDFS-17311:
---

slfan1989 commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899809792

   > LGTM @slfan1989 any further comments?
   
   @goiri Thanks for reviewing the code! LGTM +1.




> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808466#comment-17808466
 ] 

ASF GitHub Bot commented on HDFS-17311:
---

slfan1989 commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899809388

   @LiuGuH Thanks for the contribution! Can we trigger compilation again?




> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808454#comment-17808454
 ] 

ASF GitHub Bot commented on HDFS-17332:
---

xinglin commented on PR #6446:
URL: https://github.com/apache/hadoop/pull/6446#issuecomment-1899717945

   Thanks @ctrezzo, @li-leyang and @mccormickt12 for reviewing




> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> --
>
> Key: HDFS-17332
> URL: https://issues.apache.org/jira/browse/HDFS-17332
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 
>  
> Example stack trace
> {code:java}
> [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: 
> Failed to connect to 10.150.91.13/10.150.91.13:71 for file 
> //part--95b9909c-zzz-c000.avro for block 
> BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException:
>  6 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/ip:40492 
> remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 
> millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/localIp:40492 
> remote=datanodeIP:71] [12]: at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
> [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
> [12]: at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
>  [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
>  [12]: at 
> org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808453#comment-17808453
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

KeeProMise commented on PR #6380:
URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1899693424

   > > @huangzhaobo99 do you still have concerns with the approach?
   > 
   > @goiri No worries anymore, I think the sharing mechanism is really good, 
and percentage based allocation is easier to use. cc @KeeProMise
   
   @goiri @huangzhaobo99 Thanks for your review. If no more comments here, 
please help merge it, thanks! @goiri 




> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, 
> HDFS-17302.003.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns. In addition, when I reconfigure the total number of handlers on 
> the router, I have to re-allocate handlers to each ns, which undoubtedly 
> increases the complexity of operation and maintenance.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are 

[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808452#comment-17808452
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

KeeProMise commented on PR #6380:
URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1899691942

   > > @huangzhaobo99 do you still have concerns with the approach?
   > 
   > @goiri No worries anymore, I think the sharing mechanism is really good, 
and percentage based allocation is easier to use. cc @KeeProMise
   
   @goiri @huangzhaobo99 Thanks for your review. If no more comments here, 
please help merge it, thanks! @goiri 




> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, 
> HDFS-17302.003.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns. In addition, when I reconfigure the total number of handlers on 
> the router, I have to re-allocate handlers to each ns, which undoubtedly 
> increases the complexity of operation and maintenance.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are 

[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808450#comment-17808450
 ] 

ASF GitHub Bot commented on HDFS-17293:
---

zhangshuyan0 commented on PR #6368:
URL: https://github.com/apache/hadoop/pull/6368#issuecomment-1899635293

   This PR has corrected the size of the first packet in a new block, which is 
great. However, due to the original logical problem in `adjustChunkBoundary`, 
the calculation of the size of the last packet in a block is still problematic, 
and I think we need a new PR to solve it.
   
https://github.com/apache/hadoop/blob/27ecc23ae7c5cafba6a5ea58d4a68d25bd7507dd/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L531-L543
   Line540, when we pass `blockSize - getStreamer().getBytesCurBlock()` to 
`computePacketChunkSize` as the first parameter, `computePacketChunkSize` is 
likely to split the data that could have been sent in one packet into two 
packets for sending.




> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17313) dfsadmin -reconfig option to start/query reconfig on all live namenodes.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808449#comment-17808449
 ] 

ASF GitHub Bot commented on HDFS-17313:
---

huangzhaobo99 commented on PR #6395:
URL: https://github.com/apache/hadoop/pull/6395#issuecomment-1899632674

   @goiri If you have time, Can you also help me review this? At that time, 
there was a batch refresh of the DN, but the relevant reviewers have not 
replied to me. This update is for the batch refresh mechanism of nn. Thx.




> dfsadmin -reconfig option to start/query reconfig on all live namenodes.
> 
>
> Key: HDFS-17313
> URL: https://issues.apache.org/jira/browse/HDFS-17313
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: huangzhaobo
>Assignee: huangzhaobo
>Priority: Major
>  Labels: pull-request-available
>
> https://issues.apache.org/jira/browse/HDFS-16568 Support batch refreshing of 
> datanode configurations.
> There are several nn in the HA or Federated Cluster, and this ticket 
> implements batch refreshing of nn configurations.
> *Implementation method*
>  # Use the DFSUtil.getNNServiceRpcAddressesForCluster method to parse the 
> configuration and obtain the addresses of all nn's
>  # Using two worker threads, currently does not support configuring the 
> number of worker threads (will be implemented in other ticket if necessary)
> *Sample outputs*
> {code:java}
> $ bin/hdfs dfsadmin -reconfig namenode livenodes start
> Started reconfiguration task on node [localhost:50034].
> Started reconfiguration task on node [localhost:50036].
> Started reconfiguration task on node [localhost:50038].
> Started reconfiguration task on node [localhost:50040]. 
> Starting of reconfiguration task successful on 4 nodes, failed on 0 nodes.
> $ bin/hdfs dfsadmin -reconfig namenode livenodes status
> Reconfiguring status for node [localhost:50034]
> SUCCESS: Changed property dfs.heartbeat.interval
>   From: "5"
>   To: "3"
> Reconfiguring status for node [localhost:50036]
> SUCCESS: Changed property dfs.heartbeat.interval
>   From: "5"
>   To: "3"
> Reconfiguring status for node [localhost:50038]
> SUCCESS: Changed property dfs.heartbeat.interval
>   From: "5"
>   To: "3"
> Reconfiguring status for node [localhost:50040]
> SUCCESS: Changed property dfs.heartbeat.interval
>   From: "5"
>   To: "3"
> Retrieval of reconfiguration status successful on 4 nodes, failed on 0 
> nodes.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808446#comment-17808446
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

huangzhaobo99 commented on PR #6380:
URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1899621973

   > @huangzhaobo99 do you still have concerns with the approach?
   
   @goiri  No worries anymore, I think the sharing mechanism is really good, 
and percentage based allocation is easier to use. cc @KeeProMise 




> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, 
> HDFS-17302.003.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns. In addition, when I reconfigure the total number of handlers on 
> the router, I have to re-allocate handlers to each ns, which undoubtedly 
> increases the complexity of operation and maintenance.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are also under normal load, or are under higher load, we do not need to 
> isolate them, these ns of the same nature can share 

[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808444#comment-17808444
 ] 

ASF GitHub Bot commented on HDFS-17293:
---

zhangshuyan0 commented on code in PR #6368:
URL: https://github.com/apache/hadoop/pull/6368#discussion_r1458246249


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java:
##
@@ -184,6 +186,40 @@ public void testPreventOverflow() throws IOException, 
NoSuchFieldException,
 runAdjustChunkBoundary(configuredWritePacketSize, finalWritePacketSize);
   }
 
+  @Test(timeout=6)
+  public void testFirstPacketSizeInNewBlocks() throws IOException {
+final long blockSize = 1L * 1024 * 1024;
+final int numDataNodes = 3;
+final Configuration dfsConf = new Configuration();
+dfsConf.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster dfsCluster = null;
+dfsCluster = new 
MiniDFSCluster.Builder(dfsConf).numDataNodes(numDataNodes).build();
+dfsCluster.waitActive();
+
+DistributedFileSystem fs = dfsCluster.getFileSystem();
+Path fileName = new Path("/testfile.dat");
+FSDataOutputStream fos = fs.create(fileName);
+DataChecksum crc32c = 
DataChecksum.newDataChecksum(DataChecksum.Type.CRC32C, 512);
+
+long loop = 0;
+Random r = new Random();
+byte[] buf = new byte[1 * 1024 * 1024];
+r.nextBytes(buf);
+fos.write(buf);
+fos.hflush();
+
+while (loop < 20) {
+  r.nextBytes(buf);
+  fos.write(buf);
+  fos.hflush();
+  loop++;
+  Assert.assertNotEquals(crc32c.getBytesPerChecksum() + 
crc32c.getChecksumSize(),

Review Comment:
   It is more appropriate to precisely specify the expected `packetSize` here.
   Outside the `while loop`:
   ```
   int chunkSize = crc32c.getBytesPerChecksum() + crc32c.getChecksumSize();
   int packetContentSize = (dfsConf.getInt(DFS_CLIENT_WRITE_PACKET_SIZE_KEY, 
DFS_CLIENT_WRITE_PACKET_SIZE_DEFAULT) - 
PacketHeader.PKT_MAX_HEADER_LEN)/chunkSize*chunkSize;
   ```
   And here:
   ```
   Assert.assertEquals(((DFSOutputStream) fos.getWrappedStream()).packetSize, 
packetContentSize);
   ```



##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java:
##
@@ -184,6 +186,40 @@ public void testPreventOverflow() throws IOException, 
NoSuchFieldException,
 runAdjustChunkBoundary(configuredWritePacketSize, finalWritePacketSize);
   }
 
+  @Test(timeout=6)
+  public void testFirstPacketSizeInNewBlocks() throws IOException {
+final long blockSize = 1L * 1024 * 1024;
+final int numDataNodes = 3;
+final Configuration dfsConf = new Configuration();
+dfsConf.setLong(DFS_BLOCK_SIZE_KEY, blockSize);
+MiniDFSCluster dfsCluster = null;
+dfsCluster = new 
MiniDFSCluster.Builder(dfsConf).numDataNodes(numDataNodes).build();
+dfsCluster.waitActive();
+
+DistributedFileSystem fs = dfsCluster.getFileSystem();
+Path fileName = new Path("/testfile.dat");
+FSDataOutputStream fos = fs.create(fileName);
+DataChecksum crc32c = 
DataChecksum.newDataChecksum(DataChecksum.Type.CRC32C, 512);
+
+long loop = 0;
+Random r = new Random();
+byte[] buf = new byte[1 * 1024 * 1024];

Review Comment:
   `byte[] buf = new byte[(int) blockSize];`





> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808427#comment-17808427
 ] 

ASF GitHub Bot commented on HDFS-17332:
---

ctrezzo merged PR #6446:
URL: https://github.com/apache/hadoop/pull/6446




> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> --
>
> Key: HDFS-17332
> URL: https://issues.apache.org/jira/browse/HDFS-17332
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Minor
>  Labels: pull-request-available
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 
>  
> Example stack trace
> {code:java}
> [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: 
> Failed to connect to 10.150.91.13/10.150.91.13:71 for file 
> //part--95b9909c-zzz-c000.avro for block 
> BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException:
>  6 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/ip:40492 
> remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 
> millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/localIp:40492 
> remote=datanodeIP:71] [12]: at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
> [12]: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
> [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
> [12]: at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
>  [12]: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
>  [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
> [12]: at 
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
>  [12]: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
>  [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
>  [12]: at 
> org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) 
> [12]: at 
> org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
> [12]: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, 

[jira] [Updated] (HDFS-17341) Support dedicated user queues in Namenode FairCallQueue

2024-01-18 Thread Lei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei Yang updated HDFS-17341:

Description: 
Some service users today in namenode like ETL, metrics collection, ad-hoc users 
that are critical to run business critical job accounts for many traffic in 
namenode and shouldn't be throttled the same way as other individual users in 
FCQ.

There is [feature|https://issues.apache.org/jira/browse/HADOOP-17165] in 
namenode to always prioritize some service users to not subject to FCQ 
scheduling. (Those users are always p0) but it is not perfect and it doesn't 
account for traffic surge from those users.

The idea is to allocate dedicated rpc queues for those service users with 
bounded queue capacity and allocate processing weight for those users. If queue 
is full, those users are expected to backoff and retry.

 

New configs:
{code:java}
"faircallqueue.reserved.users"; // list of service users that are assigned to 
dedicated queue
"faircallqueue.reserved.users.max"; // max number of service users allowed
"faircallqueue.reserved.users.capacities"; // custom queue capacities for each 
service user
"faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
dedicated queue{code}
For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b)

FCQ would look like:

 
{code:java}
P0: shared queue
P1: shared queue
P2: shared queue
P3: shared queue
P4: dedicated for user a
P5: dedicated for user b{code}
{color:#172b4d}The Multiplexer would have following weights{color}

{color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color}

{color:#172b4d}reserved queue weights=[3, 2]{color}

{color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total 
cycles.{color}

 

 

  was:
Some service users today in namenode like ETL, metrics collection, ad-hoc users 
that are critical to run business critical job accounts for many traffic in 
namenode and shouldn't be throttled the same way as other individual users in 
FCQ.

There is feature in namenode to always prioritize some service users to not 
subject to FCQ scheduling. (Those users are always p0) but it is not perfect 
and it doesn't account for traffic surge from those users.

The idea is to allocate dedicated rpc queues for those service users with 
bounded queue capacity and allocate processing weight for those users. If queue 
is full, those users are expected to backoff and retry.

 

New configs:
{code:java}
"faircallqueue.reserved.users"; // list of service users that are assigned to 
dedicated queue
"faircallqueue.reserved.users.max"; // max number of service users allowed
"faircallqueue.reserved.users.capacities"; // custom queue capacities for each 
service user
"faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
dedicated queue{code}
For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b)

FCQ would look like:

 
{code:java}
P0: shared queue
P1: shared queue
P2: shared queue
P3: shared queue
P4: dedicated for user a
P5: dedicated for user b{code}
{color:#172b4d}The Multiplexer would have following weights{color}

{color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color}

{color:#172b4d}reserved queue weights=[3, 2]{color}

{color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total 
cycles.{color}

 

 


> Support dedicated user queues in Namenode FairCallQueue
> ---
>
> Key: HDFS-17341
> URL: https://issues.apache.org/jira/browse/HDFS-17341
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.10.0, 3.4.0
>Reporter: Lei Yang
>Priority: Major
>  Labels: pull-request-available
>
> Some service users today in namenode like ETL, metrics collection, ad-hoc 
> users that are critical to run business critical job accounts for many 
> traffic in namenode and shouldn't be throttled the same way as other 
> individual users in FCQ.
> There is [feature|https://issues.apache.org/jira/browse/HADOOP-17165] in 
> namenode to always prioritize some service users to not subject to FCQ 
> scheduling. (Those users are always p0) but it is not perfect and it doesn't 
> account for traffic surge from those users.
> The idea is to allocate dedicated rpc queues for those service users with 
> bounded queue capacity and allocate processing weight for those users. If 
> queue is full, those users are expected to backoff and retry.
>  
> New configs:
> {code:java}
> "faircallqueue.reserved.users"; // list of service users that are assigned to 
> dedicated queue
> "faircallqueue.reserved.users.max"; // max number of service users allowed
> "faircallqueue.reserved.users.capacities"; // custom queue capacities for 
> each service user
> "faircallqueue.multiplexer.reserved.weights"; 

[jira] [Commented] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808400#comment-17808400
 ] 

ASF GitHub Bot commented on HDFS-17343:
---

slfan1989 commented on PR #6457:
URL: https://github.com/apache/hadoop/pull/6457#issuecomment-1899387396

   @ayushtkn Can you help review this PR? Thank you very much!




> Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
> -
>
> Key: HDFS-17343
> URL: https://issues.apache.org/jira/browse/HDFS-17343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>
> When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause 
> mis-order of ibr and fbr on datanode. After discussion, we decided to revert 
> HDFS-16016.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17341) Support dedicated user queues in Namenode FairCallQueue

2024-01-18 Thread Lei Yang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808392#comment-17808392
 ] 

Lei Yang commented on HDFS-17341:
-

[~hexiaoqiao] Thanks for your comment. 
{quote}One concerns, we evaluate the request if high- or low- priority based on 
user only, but not all requests from this user are always high or low priority 
in fact.
{quote}
Not sure I understand this. The idea is to get some critical service users 
exempt from existing FCQ mechanism to make sure they are not throttled in the 
same way as regular users in shared queue. Meanwhile, those users should not 
flood the entire queue if there are traffic 
surge(https://issues.apache.org/jira/browse/HADOOP-17165 can assign service 
user to p0 but it cannot solve the traffic surge from those users). We can 
assign weights for those users to ensure they are not exceeding certain % of 
total processing cycles.

 

> Support dedicated user queues in Namenode FairCallQueue
> ---
>
> Key: HDFS-17341
> URL: https://issues.apache.org/jira/browse/HDFS-17341
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.10.0, 3.4.0
>Reporter: Lei Yang
>Priority: Major
>  Labels: pull-request-available
>
> Some service users today in namenode like ETL, metrics collection, ad-hoc 
> users that are critical to run business critical job accounts for many 
> traffic in namenode and shouldn't be throttled the same way as other 
> individual users in FCQ.
> There is feature in namenode to always prioritize some service users to not 
> subject to FCQ scheduling. (Those users are always p0) but it is not perfect 
> and it doesn't account for traffic surge from those users.
> The idea is to allocate dedicated rpc queues for those service users with 
> bounded queue capacity and allocate processing weight for those users. If 
> queue is full, those users are expected to backoff and retry.
>  
> New configs:
> {code:java}
> "faircallqueue.reserved.users"; // list of service users that are assigned to 
> dedicated queue
> "faircallqueue.reserved.users.max"; // max number of service users allowed
> "faircallqueue.reserved.users.capacities"; // custom queue capacities for 
> each service user
> "faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
> dedicated queue{code}
> For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b)
> FCQ would look like:
>  
> {code:java}
> P0: shared queue
> P1: shared queue
> P2: shared queue
> P3: shared queue
> P4: dedicated for user a
> P5: dedicated for user b{code}
> {color:#172b4d}The Multiplexer would have following weights{color}
> {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color}
> {color:#172b4d}reserved queue weights=[3, 2]{color}
> {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total 
> cycles.{color}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17341) Support dedicated user queues in Namenode FairCallQueue

2024-01-18 Thread Lei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei Yang updated HDFS-17341:

Description: 
Some service users today in namenode like ETL, metrics collection, ad-hoc users 
that are critical to run business critical job accounts for many traffic in 
namenode and shouldn't be throttled the same way as other individual users in 
FCQ.

There is feature in namenode to always prioritize some service users to not 
subject to FCQ scheduling. (Those users are always p0) but it is not perfect 
and it doesn't account for traffic surge from those users.

The idea is to allocate dedicated rpc queues for those service users with 
bounded queue capacity and allocate processing weight for those users. If queue 
is full, those users are expected to backoff and retry.

 

New configs:
{code:java}
"faircallqueue.reserved.users"; // list of service users that are assigned to 
dedicated queue
"faircallqueue.reserved.users.max"; // max number of service users allowed
"faircallqueue.reserved.users.capacities"; // custom queue capacities for each 
service user
"faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
dedicated queue{code}
For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b)

FCQ would look like:

 
{code:java}
P0: shared queue
P1: shared queue
P2: shared queue
P3: shared queue
P4: dedicated for user a
P5: dedicated for user b{code}
{color:#172b4d}The Multiplexer would have following weights{color}

{color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color}

{color:#172b4d}reserved queue weights=[3, 2]{color}

{color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total 
cycles.{color}

 

 

  was:
Some service users today in namenode like ETL, metrics collection, ad-hoc users 
that are critical to run business critical job accounts for many traffic in 
namenode and shouldn't be throttled the same way as other individual users in 
FCQ.

There is feature in namenode to always prioritize some service users to not 
subject to FCQ scheduling. (Those users are always p0) but it is not perfect 
and it doesn't account for traffic surge from those users.

The idea is to allocate dedicated rpc queues for those service users with 
bounded queue capacity and allocate processing weight for those users. If queue 
is full, those users are expected to backoff and retry.

 

New configs:
{code:java}
"faircallqueue.reserved.users"; // list of service users that are assigned to 
dedicated queue
"faircallqueue.reserved.users.max"; // max number of service users allowed
"faircallqueue.reserved.users.capacities"; // custom queue capacities for each 
service user
"faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
dedicated queue{code}
For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b)

FCQ would look like:

 
{code:java}
P0: shared queue
P1: shared queue
P2: shared queue
P3: shared queue
P4: dedicated for user a
P5: dedicated for user b{code}
{color:#172b4d}The WRM would have following weights{color}

{color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color}

{color:#172b4d}reserved queue weights=[3, 2]{color}

{color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total 
cycles.{color}

 

 


> Support dedicated user queues in Namenode FairCallQueue
> ---
>
> Key: HDFS-17341
> URL: https://issues.apache.org/jira/browse/HDFS-17341
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.10.0, 3.4.0
>Reporter: Lei Yang
>Priority: Major
>  Labels: pull-request-available
>
> Some service users today in namenode like ETL, metrics collection, ad-hoc 
> users that are critical to run business critical job accounts for many 
> traffic in namenode and shouldn't be throttled the same way as other 
> individual users in FCQ.
> There is feature in namenode to always prioritize some service users to not 
> subject to FCQ scheduling. (Those users are always p0) but it is not perfect 
> and it doesn't account for traffic surge from those users.
> The idea is to allocate dedicated rpc queues for those service users with 
> bounded queue capacity and allocate processing weight for those users. If 
> queue is full, those users are expected to backoff and retry.
>  
> New configs:
> {code:java}
> "faircallqueue.reserved.users"; // list of service users that are assigned to 
> dedicated queue
> "faircallqueue.reserved.users.max"; // max number of service users allowed
> "faircallqueue.reserved.users.capacities"; // custom queue capacities for 
> each service user
> "faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
> dedicated queue{code}
> For instance, for a FCQ with 4 priority levels, 2 reserved 

[jira] [Updated] (HDFS-17341) Support dedicated user queues in Namenode FairCallQueue

2024-01-18 Thread Lei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei Yang updated HDFS-17341:

Description: 
Some service users today in namenode like ETL, metrics collection, ad-hoc users 
that are critical to run business critical job accounts for many traffic in 
namenode and shouldn't be throttled the same way as other individual users in 
FCQ.

There is feature in namenode to always prioritize some service users to not 
subject to FCQ scheduling. (Those users are always p0) but it is not perfect 
and it doesn't account for traffic surge from those users.

The idea is to allocate dedicated rpc queues for those service users with 
bounded queue capacity and allocate processing weight for those users. If queue 
is full, those users are expected to backoff and retry.

 

New configs:
{code:java}
"faircallqueue.reserved.users"; // list of service users that are assigned to 
dedicated queue
"faircallqueue.reserved.users.max"; // max number of service users allowed
"faircallqueue.reserved.users.capacities"; // custom queue capacities for each 
service user
"faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
dedicated queue{code}
For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b)

FCQ would look like:

 
{code:java}
P0: shared queue
P1: shared queue
P2: shared queue
P3: shared queue
P4: dedicated for user a
P5: dedicated for user b{code}
{color:#172b4d}The WRM would have following weights{color}

{color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color}

{color:#172b4d}reserved queue weights=[3, 2]{color}

{color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total 
cycles.{color}

 

 

  was:
Some service users today in namenode like ETL, metrics collection, ad-hoc users 
that are critical to run business critical job accounts for many traffic in 
namenode and shouldn't be throttled the same way as other individual users in 
FCQ.

There is feature in namenode to always prioritize some service users to not 
subject to FCQ scheduling. (Those users are always p0) but it is not perfect 
and it doesn't account for traffic surge from those users.

The idea is to allocate dedicated rpc queues for those service users with 
bounded queue capacity and allocate processing weight for those users. If queue 
is full, those users are expected to backoff and retry.

 

New configs:
{code:java}
"faircallqueue.reserved.users"; // list of service users that are assigned to 
dedicated queue
"faircallqueue.reserved.users.max"; // max number of service users allowed
"faircallqueue.reserved.users.capacities"; // custom queue capacities for each 
service user
"faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
dedicated queue{code}
 


> Support dedicated user queues in Namenode FairCallQueue
> ---
>
> Key: HDFS-17341
> URL: https://issues.apache.org/jira/browse/HDFS-17341
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.10.0, 3.4.0
>Reporter: Lei Yang
>Priority: Major
>  Labels: pull-request-available
>
> Some service users today in namenode like ETL, metrics collection, ad-hoc 
> users that are critical to run business critical job accounts for many 
> traffic in namenode and shouldn't be throttled the same way as other 
> individual users in FCQ.
> There is feature in namenode to always prioritize some service users to not 
> subject to FCQ scheduling. (Those users are always p0) but it is not perfect 
> and it doesn't account for traffic surge from those users.
> The idea is to allocate dedicated rpc queues for those service users with 
> bounded queue capacity and allocate processing weight for those users. If 
> queue is full, those users are expected to backoff and retry.
>  
> New configs:
> {code:java}
> "faircallqueue.reserved.users"; // list of service users that are assigned to 
> dedicated queue
> "faircallqueue.reserved.users.max"; // max number of service users allowed
> "faircallqueue.reserved.users.capacities"; // custom queue capacities for 
> each service user
> "faircallqueue.multiplexer.reserved.weights"; // processing weights for each 
> dedicated queue{code}
> For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b)
> FCQ would look like:
>  
> {code:java}
> P0: shared queue
> P1: shared queue
> P2: shared queue
> P3: shared queue
> P4: dedicated for user a
> P5: dedicated for user b{code}
> {color:#172b4d}The WRM would have following weights{color}
> {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color}
> {color:#172b4d}reserved queue weights=[3, 2]{color}
> {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total 
> cycles.{color}
>  
>  



--
This message was sent by Atlassian 

[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808371#comment-17808371
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

hadoop-yetus commented on PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1899203477

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 23s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m  2s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 38s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 43s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 39s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 42s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 29s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 39s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 45s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 30s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 87 unchanged - 
0 fixed = 88 total (was 87)  |
   | +1 :green_heart: |  mvnsite  |   0m 37s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 40s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m  7s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 601m 20s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 27s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 689m 46s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
   |   | hadoop.hdfs.TestDFSStripedInputStream |
   |   | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl |
   |   | hadoop.hdfs.TestParallelShortCircuitReadNoChecksum |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6464 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 30043234e0f6 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 83eab24c7696017a24412340514a6977b6a394af |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 

[jira] [Commented] (HDFS-17339) BPServiceActor should skip cacheReport when one blockPool does not have CacheBlock on this DataNode

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808361#comment-17808361
 ] 

ASF GitHub Bot commented on HDFS-17339:
---

hadoop-yetus commented on PR #6456:
URL: https://github.com/apache/hadoop/pull/6456#issuecomment-1899150670

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m  9s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   1m 10s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  9s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 22s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  5s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 14s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 23s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  4s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 32s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 225m 18s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6456/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 41s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 360m 57s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6456/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6456 |
   | JIRA Issue | HDFS-17339 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux efee4dd1d356 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 
13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 243af8bea73098685bbca84a7c22e9e98fedcd57 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6456/2/testReport/ |
   | Max. process+thread count | 4168 (vs. ulimit of 5500) |
   | modules | 

[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808352#comment-17808352
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

goiri commented on PR #6380:
URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1899080274

   @huangzhaobo99 do you still have concerns with the approach?




> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, 
> HDFS-17302.003.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns. In addition, when I reconfigure the total number of handlers on 
> the router, I have to re-allocate handlers to each ns, which undoubtedly 
> increases the complexity of operation and maintenance.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are also under normal load, or are under higher load, we do not need to 
> isolate them, these ns of the same nature can share the handlers of the 
> router; The performance is better than assigning a fixed number of handlers 
> to each ns, because each ns can use all the handlers of 

[jira] [Commented] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808289#comment-17808289
 ] 

ASF GitHub Bot commented on HDFS-17343:
---

virajjasani commented on PR #6457:
URL: https://github.com/apache/hadoop/pull/6457#issuecomment-1898800060

   Thanks for working on the revert to unblock 3.4.0!




> Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
> -
>
> Key: HDFS-17343
> URL: https://issues.apache.org/jira/browse/HDFS-17343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>
> When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause 
> mis-order of ibr and fbr on datanode. After discussion, we decided to revert 
> HDFS-16016.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808272#comment-17808272
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

hadoop-yetus commented on PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1898730447

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m 14s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 44s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 40s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  5s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 58s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 48s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 39s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 39s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 37s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 32s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 87 unchanged - 
0 fixed = 88 total (was 87)  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  3s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 53s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m 19s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 211m 24s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 21s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 306m 44s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestDFSStripedInputStream |
   |   | hadoop.hdfs.TestEncryptionZonesWithKMS |
   |   | hadoop.hdfs.TestReconstructStripedFile |
   |   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.server.namenode.TestReconstructStripedBlocks |
   |   | hadoop.hdfs.TestDFSStripedOutputStreamWithRandomECPolicy |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6464 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 0ea7c919d276 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 

[jira] [Resolved] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html

2024-01-18 Thread Shuyan Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuyan Zhang resolved HDFS-17331.
-
   Fix Version/s: 3.5.0
Hadoop Flags: Reviewed
Target Version/s: 3.5.0
Assignee: lei w
  Resolution: Fixed

> Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in 
> federationhealth.html
> ---
>
> Key: HDFS-17331
> URL: https://issues.apache.org/jira/browse/HDFS-17331
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: lei w
>Assignee: lei w
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: After fix.png, Before fix.png
>
>
> Blocks are always -1 and DataNode`s version are always UNKNOWN in 
> federationhealth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808194#comment-17808194
 ] 

ASF GitHub Bot commented on HDFS-17331:
---

zhangshuyan0 merged PR #6429:
URL: https://github.com/apache/hadoop/pull/6429




> Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in 
> federationhealth.html
> ---
>
> Key: HDFS-17331
> URL: https://issues.apache.org/jira/browse/HDFS-17331
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: lei w
>Priority: Major
>  Labels: pull-request-available
> Attachments: After fix.png, Before fix.png
>
>
> Blocks are always -1 and DataNode`s version are always UNKNOWN in 
> federationhealth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808185#comment-17808185
 ] 

ASF GitHub Bot commented on HDFS-17331:
---

hadoop-yetus commented on PR #6429:
URL: https://github.com/apache/hadoop/pull/6429#issuecomment-1898411792

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  buf  |   0m  0s |  |  buf was not available.  |
   | +0 :ok: |  buf  |   0m  0s |  |  buf was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 44s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  19m 17s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   2m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   2m 46s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 44s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 58s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   2m 11s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  19m 32s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 20s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 44s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  cc  |   2m 45s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 45s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  cc  |   2m 42s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 42s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 37s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 36s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   2m 14s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  19m 34s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 49s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | +1 :green_heart: |  unit  |  18m 59s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 25s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 118m 59s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6429/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6429 |
   | JIRA Issue | HDFS-17331 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets cc buflint 
bufcompat |
   | uname | Linux b6670bf98a38 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 6ee5a462a0abb05345f2bd3fbe71ca2e4bb54569 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 

[jira] [Commented] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808146#comment-17808146
 ] 

ASF GitHub Bot commented on HDFS-17332:
---

hadoop-yetus commented on PR #6446:
URL: https://github.com/apache/hadoop/pull/6446#issuecomment-1898272516

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 49s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  35m 28s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m 12s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   5m 46s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 26s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 19s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 20s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 58s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 45s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  2s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   5m 54s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 45s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 45s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m 19s |  |  hadoop-hdfs-project: The 
patch generated 0 new + 43 unchanged - 1 fixed = 43 total (was 44)  |
   | +1 :green_heart: |  mvnsite  |   2m  2s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m  5s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m  1s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m 45s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 24s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 254m 38s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 45s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 441m 21s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6446/10/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6446 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 4426bef6a3ba 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 22d136773a102e3f317e6b785cc9c15f41308f4d |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6446/10/testReport/ |
   | Max. process+thread count | 2189 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-client 
hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project |
   | Console output | 

[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808106#comment-17808106
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

hadoop-yetus commented on PR #6464:
URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1898114627

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 25s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | -1 :x: |  mvninstall  |   0m 21s | 
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-mvninstall-root.txt)
 |  root in trunk failed.  |
   | -1 :x: |  compile  |   0m 21s | 
[/branch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-hdfs in trunk failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  compile  |   0m 21s | 
[/branch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -0 :warning: |  checkstyle  |   0m 20s | 
[/buildtool-branch-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/buildtool-branch-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  The patch fails to run checkstyle in hadoop-hdfs  |
   | -1 :x: |  mvnsite  |   0m 21s | 
[/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | -1 :x: |  javadoc  |   0m 27s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-hdfs in trunk failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  javadoc  |   3m 21s | 
[/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt)
 |  hadoop-hdfs in trunk failed with JDK Private 
Build-1.8.0_392-8u392-ga-1~20.04-b08.  |
   | -1 :x: |  spotbugs  |   0m 21s | 
[/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in trunk failed.  |
   | +1 :green_heart: |  shadedclient  |   5m 33s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | -1 :x: |  mvninstall  |   0m 22s | 
[/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | -1 :x: |  compile  |   0m 20s | 
[/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-hdfs in the patch failed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.  |
   | -1 :x: |  javac  |   0m 20s | 
[/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt)
 |  hadoop-hdfs in the patch failed with JDK 

[jira] [Updated] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17342:
--
Labels: pull-request-available  (was: )

> Fix DataNode may invalidates normal block causing missing block
> ---
>
> Key: HDFS-17342
> URL: https://issues.apache.org/jira/browse/HDFS-17342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808090#comment-17808090
 ] 

ASF GitHub Bot commented on HDFS-17342:
---

haiyang1987 opened a new pull request, #6464:
URL: https://github.com/apache/hadoop/pull/6464

   ### Description of PR
   https://issues.apache.org/jira/browse/HDFS-17342
   
   When users read an append file, occasional exceptions may occur, such as 
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
   
   This can happen if one thread is reading the block while writer thread is 
finalizing it simultaneously.
   
   **Root cause:**
   
   1. The reader thread obtains a RBW replica from VolumeMap, such as: 
blk_xxx_xxx[RBW] and the data file should be in /XXX/rbw/blk_xxx.
   2. Simultaneously, the writer thread will finalize this block, moving it 
from the RBW directory to the FINALIZE directory. the data file is move from 
/XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
   3. The reader thread attempts to open this data input stream but encounters 
a FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
/XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
   4. The reader thread will treats this block as corrupt, removes the replica 
from the volume map, and the DataNode reports the deleted block to the NameNode.
   5. The NameNode removes this replica for the block.
   6. If the current file replication is 1, this file will cause a missing 
block issue until this DataNode executes the DirectoryScanner again.
   
   As described above, when the reader thread encountered FileNotFoundException 
is as expected, because the file is moved.
   So we need to add a double check to the invalidateMissingBlock logic to 
verify whether the data file or meta file exists to avoid similar cases.
   




> Fix DataNode may invalidates normal block causing missing block
> ---
>
> Key: HDFS-17342
> URL: https://issues.apache.org/jira/browse/HDFS-17342
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> When users read an append file, occasional exceptions may occur, such as 
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx.
> This can happen if one thread is reading the block while writer thread is 
> finalizing it simultaneously.
> *Root cause:*
> # The reader thread obtains a RBW replica from VolumeMap, such as: 
> blk_xxx_xxx[RBW] and  the data file should be in /XXX/rbw/blk_xxx.
> # Simultaneously, the writer thread will finalize this block, moving it from 
> the RBW directory to the FINALIZE directory. the data file is move from 
> /XXX/rbw/block_xxx to /XXX/finalize/block_xxx.
> # The reader thread attempts to open this data input stream but encounters a 
> FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file 
> /XXX/rbw/blk_xxx_xxx doesn't exist at this moment.
> # The reader thread  will treats this block as corrupt, removes the replica 
> from the volume map, and the DataNode reports the deleted block to the 
> NameNode.
> # The NameNode removes this replica for the block.
> # If the current file replication is 1, this file will cause a missing block 
> issue until this DataNode executes the DirectoryScanner again.
> As described above, when the reader thread encountered FileNotFoundException 
> is as expected, because the file is moved.
> So we need to add a double check to the invalidateMissingBlock logic to 
> verify whether the data file or meta file exists to avoid similar cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method

2024-01-18 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba resolved HDFS-17334.
--
Resolution: Not A Problem

> FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait 
> method
> ---
>
> Key: HDFS-17334
> URL: https://issues.apache.org/jira/browse/HDFS-17334
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In method FSEditLogAsync#enqueueEdit , there exist the below codes:
> {code:java}
> if (Thread.holdsLock(this)) {
>           // if queue is full, synchronized caller must immediately relinquish
>           // the monitor before re-offering to avoid deadlock with sync thread
>           // which needs the monitor to write transactions.
>           int permits = overflowMutex.drainPermits();
>           try {
>             do {
>               this.wait(1000); // will be notified by next logSync.
>             } while (!editPendingQ.offer(edit));
>           } finally {
>             overflowMutex.release(permits);
>           }
>         }  {code}
> It maybe invoke this.wait(1000) without having object this's monitor.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808077#comment-17808077
 ] 

ASF GitHub Bot commented on HDFS-17334:
---

hfutatzhanghb closed pull request #6434: HDFS-17334. FSEditLogAsync#enqueueEdit 
does not synchronized this before invoke wait method.
URL: https://github.com/apache/hadoop/pull/6434




> FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait 
> method
> ---
>
> Key: HDFS-17334
> URL: https://issues.apache.org/jira/browse/HDFS-17334
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In method FSEditLogAsync#enqueueEdit , there exist the below codes:
> {code:java}
> if (Thread.holdsLock(this)) {
>           // if queue is full, synchronized caller must immediately relinquish
>           // the monitor before re-offering to avoid deadlock with sync thread
>           // which needs the monitor to write transactions.
>           int permits = overflowMutex.drainPermits();
>           try {
>             do {
>               this.wait(1000); // will be notified by next logSync.
>             } while (!editPendingQ.offer(edit));
>           } finally {
>             overflowMutex.release(permits);
>           }
>         }  {code}
> It maybe invoke this.wait(1000) without having object this's monitor.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808076#comment-17808076
 ] 

ASF GitHub Bot commented on HDFS-17334:
---

hfutatzhanghb commented on PR #6434:
URL: https://github.com/apache/hadoop/pull/6434#issuecomment-1898012100

   > > > Line211 has already ensured that we have a monitor for this object:
   > > > 
https://github.com/apache/hadoop/blob/ba6ada73acc2bce560878272c543534c21c76f22/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java#L211-L223
   > > > 
   > > > So, I think the description in this PR is not a problem. What's your 
opinion? @hfutatzhanghb
   > > 
   > > 
   > > @zhangshuyan0 Sir, `this.wait(1000);` is in do-while loop, when we 
invoke `this.wait(1000)` at first time, it will release object monitor. But in 
extreme situation, it will throw Exception when invoke `this.wait(1000)` at the 
second time, because current thread does not hold the object monitor. Waiting 
for your response~
   > 
   > Let's see [java 
doc](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-)
 :
   > 
   > > Thus, on return from the wait method, the synchronization state of the 
object and of thread T is exactly as it was when the wait method was invoked.
   > 
   > Therefore, after `this.wait(1000)` returns at first time, it obtains the 
monitor again. I think no exception will be thrown here. By the way, in this 
[java 
doc](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-)
 , `synchronized -> while loop` is showed as a recommended usage. Looking 
forward to your response.
   
   Sir, Thanks a lot for your explanations here.  I will close this PR laterly. 
Thanks again.




> FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait 
> method
> ---
>
> Key: HDFS-17334
> URL: https://issues.apache.org/jira/browse/HDFS-17334
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In method FSEditLogAsync#enqueueEdit , there exist the below codes:
> {code:java}
> if (Thread.holdsLock(this)) {
>           // if queue is full, synchronized caller must immediately relinquish
>           // the monitor before re-offering to avoid deadlock with sync thread
>           // which needs the monitor to write transactions.
>           int permits = overflowMutex.drainPermits();
>           try {
>             do {
>               this.wait(1000); // will be notified by next logSync.
>             } while (!editPendingQ.offer(edit));
>           } finally {
>             overflowMutex.release(permits);
>           }
>         }  {code}
> It maybe invoke this.wait(1000) without having object this's monitor.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808074#comment-17808074
 ] 

ASF GitHub Bot commented on HDFS-17311:
---

LiuGuH commented on PR #6392:
URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1898009075

   > LGTM @slfan1989 any further comments?
   
   @slfan1989 , Hello sir , any further comments? Thanks. 




> RBF: ConnectionManager creatorQueue should offer a pool that is not already 
> in creatorQueue.
> 
>
> Key: HDFS-17311
> URL: https://issues.apache.org/jira/browse/HDFS-17311
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
> In the Router, find blow log
>  
> 2023-12-29 15:18:54,799 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add 
> more than 2048 connections at the same time
>  
> The log indicates that ConnectionManager.creatorQueue is full at a certain 
> point. But my cluster does not have so many users cloud reach up 2048 pair of 
> .
> This may be due to the following reasons:
>  # ConnectionManager.creatorQueue is a queue that will be offered 
> ConnectionPool if ConnectionContext is not enough.
>  # ConnectionCreator thread will consume from creatorQueue and make more 
> ConnectionContexts for a ConnectionPool.
>  # Client will concurrent invoke for ConnectionManager.getConnection() for a 
> same user. And this maybe lead to add many same ConnectionPool into 
> ConnectionManager.creatorQueue.
>  # When creatorQueue is full, a new ConnectionPool will not be added in 
> successfully and log this error. This maybe lead to a really new 
> ConnectionPool clould not produce more ConnectionContexts for new user.
> So this pr try to make creatorQueue will not add same ConnectionPool at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808062#comment-17808062
 ] 

ASF GitHub Bot commented on HDFS-17334:
---

zhangshuyan0 commented on PR #6434:
URL: https://github.com/apache/hadoop/pull/6434#issuecomment-1897981697

   > > Line211 has already ensured that we have a monitor for this object:
   > > 
https://github.com/apache/hadoop/blob/ba6ada73acc2bce560878272c543534c21c76f22/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java#L211-L223
   > > 
   > > So, I think the description in this PR is not a problem. What's your 
opinion? @hfutatzhanghb
   > 
   > @zhangshuyan0 Sir, `this.wait(1000);` is in do-while loop, when we invoke 
`this.wait(1000)` at first time, it will release object monitor. But in extreme 
situation, it will throw Exception when invoke `this.wait(1000)` at the second 
time, because current thread does not hold the object monitor. Waiting for your 
response~
   
   Let's see  [JAVA 
doc](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-)
 :
   
   > Thus, on return from the wait method, the synchronization state of the 
object and of thread T is exactly as it was when the wait method was invoked.
   
   Therefore, after `this.wait(1000)` returns at first time, it obtains the 
monitor again. I think no exception will be thrown here. By the way, in this 
[JAVA 
doc](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-)
 , `synchronize -> while loop` is showed as a recommended usage. Looking 
forward to your response.
   




> FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait 
> method
> ---
>
> Key: HDFS-17334
> URL: https://issues.apache.org/jira/browse/HDFS-17334
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In method FSEditLogAsync#enqueueEdit , there exist the below codes:
> {code:java}
> if (Thread.holdsLock(this)) {
>           // if queue is full, synchronized caller must immediately relinquish
>           // the monitor before re-offering to avoid deadlock with sync thread
>           // which needs the monitor to write transactions.
>           int permits = overflowMutex.drainPermits();
>           try {
>             do {
>               this.wait(1000); // will be notified by next logSync.
>             } while (!editPendingQ.offer(edit));
>           } finally {
>             overflowMutex.release(permits);
>           }
>         }  {code}
> It maybe invoke this.wait(1000) without having object this's monitor.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org