[jira] [Updated] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block
[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haiyang Hu updated HDFS-17342: -- Component/s: datanode > Fix DataNode may invalidates normal block causing missing block > --- > > Key: HDFS-17342 > URL: https://issues.apache.org/jira/browse/HDFS-17342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When users read an append file, occasional exceptions may occur, such as > org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx. > This can happen if one thread is reading the block while writer thread is > finalizing it simultaneously. > *Root cause:* > # The reader thread obtains a RBW replica from VolumeMap, such as: > blk_xxx_xxx[RBW] and the data file should be in /XXX/rbw/blk_xxx. > # Simultaneously, the writer thread will finalize this block, moving it from > the RBW directory to the FINALIZE directory. the data file is move from > /XXX/rbw/block_xxx to /XXX/finalize/block_xxx. > # The reader thread attempts to open this data input stream but encounters a > FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file > /XXX/rbw/blk_xxx_xxx doesn't exist at this moment. > # The reader thread will treats this block as corrupt, removes the replica > from the volume map, and the DataNode reports the deleted block to the > NameNode. > # The NameNode removes this replica for the block. > # If the current file replication is 1, this file will cause a missing block > issue until this DataNode executes the DirectoryScanner again. > As described above, when the reader thread encountered FileNotFoundException > is as expected, because the file is moved. > So we need to add a double check to the invalidateMissingBlock logic to > verify whether the data file or meta file exists to avoid similar cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block
[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haiyang Hu updated HDFS-17342: -- Fix Version/s: 3.5.0 > Fix DataNode may invalidates normal block causing missing block > --- > > Key: HDFS-17342 > URL: https://issues.apache.org/jira/browse/HDFS-17342 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When users read an append file, occasional exceptions may occur, such as > org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx. > This can happen if one thread is reading the block while writer thread is > finalizing it simultaneously. > *Root cause:* > # The reader thread obtains a RBW replica from VolumeMap, such as: > blk_xxx_xxx[RBW] and the data file should be in /XXX/rbw/blk_xxx. > # Simultaneously, the writer thread will finalize this block, moving it from > the RBW directory to the FINALIZE directory. the data file is move from > /XXX/rbw/block_xxx to /XXX/finalize/block_xxx. > # The reader thread attempts to open this data input stream but encounters a > FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file > /XXX/rbw/blk_xxx_xxx doesn't exist at this moment. > # The reader thread will treats this block as corrupt, removes the replica > from the volume map, and the DataNode reports the deleted block to the > NameNode. > # The NameNode removes this replica for the block. > # If the current file replication is 1, this file will cause a missing block > issue until this DataNode executes the DirectoryScanner again. > As described above, when the reader thread encountered FileNotFoundException > is as expected, because the file is moved. > So we need to add a double check to the invalidateMissingBlock logic to > verify whether the data file or meta file exists to avoid similar cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block
[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808502#comment-17808502 ] ASF GitHub Bot commented on HDFS-17342: --- hadoop-yetus commented on PR #6464: URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1899876542 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 21s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 32m 18s | | trunk passed | | +1 :green_heart: | compile | 0m 41s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 0m 37s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 9s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 39s | | trunk passed | | +1 :green_heart: | javadoc | 0m 39s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 1s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 47s | | trunk passed | | +1 :green_heart: | shadedclient | 20m 30s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 0m 39s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 0m 39s | | the patch passed | | +1 :green_heart: | compile | 0m 33s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 33s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 29s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 35s | | the patch passed | | +1 :green_heart: | javadoc | 0m 32s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 55s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 44s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 35s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 196m 24s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 27s | | The patch does not generate ASF License warnings. | | | | 283m 41s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestFileTruncate | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6464 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c7f45b3093ae 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 303ea2116d8b3373a82a310bae480b33aedc15e0 | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/4/testReport/ | | Max. process+thread count | 3961 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output |
[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.
[ https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808496#comment-17808496 ] ASF GitHub Bot commented on HDFS-17293: --- hfutatzhanghb commented on code in PR #6368: URL: https://github.com/apache/hadoop/pull/6368#discussion_r1458465700 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java: ## @@ -184,6 +186,40 @@ public void testPreventOverflow() throws IOException, NoSuchFieldException, runAdjustChunkBoundary(configuredWritePacketSize, finalWritePacketSize); } + @Test(timeout=6) + public void testFirstPacketSizeInNewBlocks() throws IOException { +final long blockSize = 1L * 1024 * 1024; +final int numDataNodes = 3; +final Configuration dfsConf = new Configuration(); +dfsConf.setLong(DFS_BLOCK_SIZE_KEY, blockSize); +MiniDFSCluster dfsCluster = null; +dfsCluster = new MiniDFSCluster.Builder(dfsConf).numDataNodes(numDataNodes).build(); +dfsCluster.waitActive(); + +DistributedFileSystem fs = dfsCluster.getFileSystem(); +Path fileName = new Path("/testfile.dat"); +FSDataOutputStream fos = fs.create(fileName); +DataChecksum crc32c = DataChecksum.newDataChecksum(DataChecksum.Type.CRC32C, 512); + +long loop = 0; +Random r = new Random(); +byte[] buf = new byte[1 * 1024 * 1024]; +r.nextBytes(buf); +fos.write(buf); +fos.hflush(); + +while (loop < 20) { + r.nextBytes(buf); + fos.write(buf); + fos.hflush(); + loop++; + Assert.assertNotEquals(crc32c.getBytesPerChecksum() + crc32c.getChecksumSize(), Review Comment: Sir, thanks for this valuable suggestion. Will fix it soon. > First packet data + checksum size will be set to 516 bytes when writing to a > new block. > --- > > Key: HDFS-17293 > URL: https://issues.apache.org/jira/browse/HDFS-17293 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.6 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > > First packet size will be set to 516 bytes when writing to a new block. > In method computePacketChunkSize, the parameters psize and csize would be > (0, 512) > when writting to a new block. It should better use writePacketSize. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.
[ https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808488#comment-17808488 ] ASF GitHub Bot commented on HDFS-17293: --- hfutatzhanghb commented on code in PR #6368: URL: https://github.com/apache/hadoop/pull/6368#discussion_r1458453315 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java: ## @@ -184,6 +186,40 @@ public void testPreventOverflow() throws IOException, NoSuchFieldException, runAdjustChunkBoundary(configuredWritePacketSize, finalWritePacketSize); } + @Test(timeout=6) + public void testFirstPacketSizeInNewBlocks() throws IOException { +final long blockSize = 1L * 1024 * 1024; +final int numDataNodes = 3; +final Configuration dfsConf = new Configuration(); +dfsConf.setLong(DFS_BLOCK_SIZE_KEY, blockSize); +MiniDFSCluster dfsCluster = null; +dfsCluster = new MiniDFSCluster.Builder(dfsConf).numDataNodes(numDataNodes).build(); +dfsCluster.waitActive(); + +DistributedFileSystem fs = dfsCluster.getFileSystem(); +Path fileName = new Path("/testfile.dat"); +FSDataOutputStream fos = fs.create(fileName); +DataChecksum crc32c = DataChecksum.newDataChecksum(DataChecksum.Type.CRC32C, 512); + +long loop = 0; +Random r = new Random(); +byte[] buf = new byte[1 * 1024 * 1024]; Review Comment: Very nice suggestion, Thanks a lot sir. I will fix them laterly > First packet data + checksum size will be set to 516 bytes when writing to a > new block. > --- > > Key: HDFS-17293 > URL: https://issues.apache.org/jira/browse/HDFS-17293 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.6 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > > First packet size will be set to 516 bytes when writing to a new block. > In method computePacketChunkSize, the parameters psize and csize would be > (0, 512) > when writting to a new block. It should better use writePacketSize. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.
[ https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808478#comment-17808478 ] farmmamba commented on HDFS-17311: -- Can use “git commit —allow-empty” 张浩博 hfutzhan...@163.com Replied Message [ https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808474#comment-17808474 ] ASF GitHub Bot commented on HDFS-17311: --- LiuGuH commented on PR #6392: URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899821409 @LiuGuH Thanks for the contribution! Can we trigger compilation again? Thanks for review. Now triggered compilation. And I triggerd compilation with command "git commit --amend && git push -f ". Is there any other way to trigger compilation? Thanks RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue. Key: HDFS-17311 URL: https://issues.apache.org/jira/browse/HDFS-17311 Project: Hadoop HDFS Issue Type: Improvement Reporter: liuguanghua Assignee: liuguanghua Priority: Major Labels: pull-request-available In the Router, find blow log 2023-12-29 15:18:54,799 ERROR org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add more than 2048 connections at the same time The log indicates that ConnectionManager.creatorQueue is full at a certain point. But my cluster does not have so many users cloud reach up 2048 pair of . This may be due to the following reasons: # ConnectionManager.creatorQueue is a queue that will be offered ConnectionPool if ConnectionContext is not enough. # ConnectionCreator thread will consume from creatorQueue and make more ConnectionContexts for a ConnectionPool. # Client will concurrent invoke for ConnectionManager.getConnection() for a same user. And this maybe lead to add many same ConnectionPool into ConnectionManager.creatorQueue. # When creatorQueue is full, a new ConnectionPool will not be added in successfully and log this error. This maybe lead to a really new ConnectionPool clould not produce more ConnectionContexts for new user. So this pr try to make creatorQueue will not add same ConnectionPool at once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org > RBF: ConnectionManager creatorQueue should offer a pool that is not already > in creatorQueue. > > > Key: HDFS-17311 > URL: https://issues.apache.org/jira/browse/HDFS-17311 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: liuguanghua >Assignee: liuguanghua >Priority: Major > Labels: pull-request-available > > In the Router, find blow log > > 2023-12-29 15:18:54,799 ERROR > org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add > more than 2048 connections at the same time > > The log indicates that ConnectionManager.creatorQueue is full at a certain > point. But my cluster does not have so many users cloud reach up 2048 pair of > . > This may be due to the following reasons: > # ConnectionManager.creatorQueue is a queue that will be offered > ConnectionPool if ConnectionContext is not enough. > # ConnectionCreator thread will consume from creatorQueue and make more > ConnectionContexts for a ConnectionPool. > # Client will concurrent invoke for ConnectionManager.getConnection() for a > same user. And this maybe lead to add many same ConnectionPool into > ConnectionManager.creatorQueue. > # When creatorQueue is full, a new ConnectionPool will not be added in > successfully and log this error. This maybe lead to a really new > ConnectionPool clould not produce more ConnectionContexts for new user. > So this pr try to make creatorQueue will not add same ConnectionPool at once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.
[ https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808474#comment-17808474 ] ASF GitHub Bot commented on HDFS-17311: --- LiuGuH commented on PR #6392: URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899821409 > @LiuGuH Thanks for the contribution! Can we trigger compilation again? Thanks for review. Now triggered compilation. And I triggerd compilation with command "git commit --amend && git push -f ".Is there any other way to trigger compilation? Thanks > RBF: ConnectionManager creatorQueue should offer a pool that is not already > in creatorQueue. > > > Key: HDFS-17311 > URL: https://issues.apache.org/jira/browse/HDFS-17311 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: liuguanghua >Assignee: liuguanghua >Priority: Major > Labels: pull-request-available > > In the Router, find blow log > > 2023-12-29 15:18:54,799 ERROR > org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add > more than 2048 connections at the same time > > The log indicates that ConnectionManager.creatorQueue is full at a certain > point. But my cluster does not have so many users cloud reach up 2048 pair of > . > This may be due to the following reasons: > # ConnectionManager.creatorQueue is a queue that will be offered > ConnectionPool if ConnectionContext is not enough. > # ConnectionCreator thread will consume from creatorQueue and make more > ConnectionContexts for a ConnectionPool. > # Client will concurrent invoke for ConnectionManager.getConnection() for a > same user. And this maybe lead to add many same ConnectionPool into > ConnectionManager.creatorQueue. > # When creatorQueue is full, a new ConnectionPool will not be added in > successfully and log this error. This maybe lead to a really new > ConnectionPool clould not produce more ConnectionContexts for new user. > So this pr try to make creatorQueue will not add same ConnectionPool at once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.
[ https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808467#comment-17808467 ] ASF GitHub Bot commented on HDFS-17311: --- slfan1989 commented on PR #6392: URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899809792 > LGTM @slfan1989 any further comments? @goiri Thanks for reviewing the code! LGTM +1. > RBF: ConnectionManager creatorQueue should offer a pool that is not already > in creatorQueue. > > > Key: HDFS-17311 > URL: https://issues.apache.org/jira/browse/HDFS-17311 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: liuguanghua >Assignee: liuguanghua >Priority: Major > Labels: pull-request-available > > In the Router, find blow log > > 2023-12-29 15:18:54,799 ERROR > org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add > more than 2048 connections at the same time > > The log indicates that ConnectionManager.creatorQueue is full at a certain > point. But my cluster does not have so many users cloud reach up 2048 pair of > . > This may be due to the following reasons: > # ConnectionManager.creatorQueue is a queue that will be offered > ConnectionPool if ConnectionContext is not enough. > # ConnectionCreator thread will consume from creatorQueue and make more > ConnectionContexts for a ConnectionPool. > # Client will concurrent invoke for ConnectionManager.getConnection() for a > same user. And this maybe lead to add many same ConnectionPool into > ConnectionManager.creatorQueue. > # When creatorQueue is full, a new ConnectionPool will not be added in > successfully and log this error. This maybe lead to a really new > ConnectionPool clould not produce more ConnectionContexts for new user. > So this pr try to make creatorQueue will not add same ConnectionPool at once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.
[ https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808466#comment-17808466 ] ASF GitHub Bot commented on HDFS-17311: --- slfan1989 commented on PR #6392: URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1899809388 @LiuGuH Thanks for the contribution! Can we trigger compilation again? > RBF: ConnectionManager creatorQueue should offer a pool that is not already > in creatorQueue. > > > Key: HDFS-17311 > URL: https://issues.apache.org/jira/browse/HDFS-17311 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: liuguanghua >Assignee: liuguanghua >Priority: Major > Labels: pull-request-available > > In the Router, find blow log > > 2023-12-29 15:18:54,799 ERROR > org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add > more than 2048 connections at the same time > > The log indicates that ConnectionManager.creatorQueue is full at a certain > point. But my cluster does not have so many users cloud reach up 2048 pair of > . > This may be due to the following reasons: > # ConnectionManager.creatorQueue is a queue that will be offered > ConnectionPool if ConnectionContext is not enough. > # ConnectionCreator thread will consume from creatorQueue and make more > ConnectionContexts for a ConnectionPool. > # Client will concurrent invoke for ConnectionManager.getConnection() for a > same user. And this maybe lead to add many same ConnectionPool into > ConnectionManager.creatorQueue. > # When creatorQueue is full, a new ConnectionPool will not be added in > successfully and log this error. This maybe lead to a really new > ConnectionPool clould not produce more ConnectionContexts for new user. > So this pr try to make creatorQueue will not add same ConnectionPool at once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808454#comment-17808454 ] ASF GitHub Bot commented on HDFS-17332: --- xinglin commented on PR #6446: URL: https://github.com/apache/hadoop/pull/6446#issuecomment-1899717945 Thanks @ctrezzo, @li-leyang and @mccormickt12 for reviewing > DFSInputStream: avoid logging stacktrace until when we really need to fail a > read request with a MissingBlockException > -- > > Key: HDFS-17332 > URL: https://issues.apache.org/jira/browse/HDFS-17332 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > In DFSInputStream#actualGetFromOneDataNode(), it would send the exception > stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most > cases, the read request will be served successfully by reading from the next > available DN. The existence of exception stacktrace in the log has caused > multiple hadoop users at Linkedin to consider this WARN message as the > RC/fatal error for their jobs. We would like to improve the log message and > avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The > stackTrace when reading reach DN is sent to the log only when we really need > to fail a read request (when chooseDataNode()/refetchLocations() throws a > BlockMissingException). > > Example stack trace > {code:java} > [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: > Failed to connect to 10.150.91.13/10.150.91.13:71 for file > //part--95b9909c-zzz-c000.avro for block > BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException: > 6 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/ip:40492 > remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 > millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/localIp:40492 > remote=datanodeIP:71] [12]: at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) > [12]: at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) > [12]: at > hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108) > [12]: at > com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39) > [12]: at > hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) > [12]: at > org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153) > [12]: at > org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) > [12]: at > org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
[ https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808453#comment-17808453 ] ASF GitHub Bot commented on HDFS-17302: --- KeeProMise commented on PR #6380: URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1899693424 > > @huangzhaobo99 do you still have concerns with the approach? > > @goiri No worries anymore, I think the sharing mechanism is really good, and percentage based allocation is easier to use. cc @KeeProMise @goiri @huangzhaobo99 Thanks for your review. If no more comments here, please help merge it, thanks! @goiri > RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation. > --- > > Key: HDFS-17302 > URL: https://issues.apache.org/jira/browse/HDFS-17302 > Project: Hadoop HDFS > Issue Type: New Feature > Components: rbf >Reporter: Jian Zhang >Assignee: Jian Zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, > HDFS-17302.003.patch > > > h2. Current shortcomings > [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a > StaticRouterRpcFairnessPolicyController to support configuring different > handlers for different ns. Using the StaticRouterRpcFairnessPolicyController > allows the router to isolate different ns, and the ns with a higher load will > not affect the router's access to the ns with a normal load. But the > StaticRouterRpcFairnessPolicyController still falls short in many ways, such > as: > 1. *Configuration is inconvenient and error-prone*: When I use > StaticRouterRpcFairnessPolicyController, I first need to know how many > handlers the router has in total, then I have to know how many nameservices > the router currently has, and then carefully calculate how many handlers to > allocate to each ns so that the sum of handlers for all ns will not exceed > the total handlers of the router, and I also need to consider how many > handlers to allocate to each ns to achieve better performance. Therefore, I > need to be very careful when configuring. Even if I configure only one more > handler for a certain ns, the total number is more than the number of > handlers owned by the router, which will also cause the router to fail to > start. At this time, I had to investigate the reason why the router failed to > start. After finding the reason, I had to reconsider the number of handlers > for each ns. In addition, when I reconfigure the total number of handlers on > the router, I have to re-allocate handlers to each ns, which undoubtedly > increases the complexity of operation and maintenance. > 2. *Extension ns is not supported*: During the running of the router, if a > new ns is added to the cluster and a mount is added for the ns, but because > no handler is allocated for the ns, the ns cannot be accessed through the > router. We must reconfigure the number of handlers and then refresh the > configuration. At this time, the router can access the ns normally. When we > reconfigure the number of handlers, we have to face disadvantage 1: > Configuration is inconvenient and error-prone. > 3. *Waste handlers*: The main purpose of proposing > RouterRpcFairnessPolicyController is to enable the router to access ns with > normal load and not be affected by ns with higher load. First of all, not all > ns have high loads; secondly, ns with high loads do not have high loads 24 > hours a day. It may be that only certain time periods, such as 0 to 8 > o'clock, have high loads, and other time periods have normal loads. Assume > there are 2 ns, and each ns is allocated half of the number of handlers. > Assume that ns1 has many requests from 0 to 14 o'clock, and almost no > requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, > and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and > 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more > requests and the other ns has almost no requests, so we have wasted half of > the number of handlers. > 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController > does not support sharing, only isolation. I think isolation is just a means > to improve the performance of router access to normal ns, not the purpose. It > is impossible for all ns in the cluster to have high loads. On the contrary, > in most scenarios, only a few ns in the cluster have high loads, and the > loads of most other ns are normal. For ns with higher load and ns with normal > load, we need to isolate their handlers so that the ns with higher load will > not affect the performance of ns with lower load. However, for nameservices > that are
[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
[ https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808452#comment-17808452 ] ASF GitHub Bot commented on HDFS-17302: --- KeeProMise commented on PR #6380: URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1899691942 > > @huangzhaobo99 do you still have concerns with the approach? > > @goiri No worries anymore, I think the sharing mechanism is really good, and percentage based allocation is easier to use. cc @KeeProMise @goiri @huangzhaobo99 Thanks for your review. If no more comments here, please help merge it, thanks! @goiri > RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation. > --- > > Key: HDFS-17302 > URL: https://issues.apache.org/jira/browse/HDFS-17302 > Project: Hadoop HDFS > Issue Type: New Feature > Components: rbf >Reporter: Jian Zhang >Assignee: Jian Zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, > HDFS-17302.003.patch > > > h2. Current shortcomings > [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a > StaticRouterRpcFairnessPolicyController to support configuring different > handlers for different ns. Using the StaticRouterRpcFairnessPolicyController > allows the router to isolate different ns, and the ns with a higher load will > not affect the router's access to the ns with a normal load. But the > StaticRouterRpcFairnessPolicyController still falls short in many ways, such > as: > 1. *Configuration is inconvenient and error-prone*: When I use > StaticRouterRpcFairnessPolicyController, I first need to know how many > handlers the router has in total, then I have to know how many nameservices > the router currently has, and then carefully calculate how many handlers to > allocate to each ns so that the sum of handlers for all ns will not exceed > the total handlers of the router, and I also need to consider how many > handlers to allocate to each ns to achieve better performance. Therefore, I > need to be very careful when configuring. Even if I configure only one more > handler for a certain ns, the total number is more than the number of > handlers owned by the router, which will also cause the router to fail to > start. At this time, I had to investigate the reason why the router failed to > start. After finding the reason, I had to reconsider the number of handlers > for each ns. In addition, when I reconfigure the total number of handlers on > the router, I have to re-allocate handlers to each ns, which undoubtedly > increases the complexity of operation and maintenance. > 2. *Extension ns is not supported*: During the running of the router, if a > new ns is added to the cluster and a mount is added for the ns, but because > no handler is allocated for the ns, the ns cannot be accessed through the > router. We must reconfigure the number of handlers and then refresh the > configuration. At this time, the router can access the ns normally. When we > reconfigure the number of handlers, we have to face disadvantage 1: > Configuration is inconvenient and error-prone. > 3. *Waste handlers*: The main purpose of proposing > RouterRpcFairnessPolicyController is to enable the router to access ns with > normal load and not be affected by ns with higher load. First of all, not all > ns have high loads; secondly, ns with high loads do not have high loads 24 > hours a day. It may be that only certain time periods, such as 0 to 8 > o'clock, have high loads, and other time periods have normal loads. Assume > there are 2 ns, and each ns is allocated half of the number of handlers. > Assume that ns1 has many requests from 0 to 14 o'clock, and almost no > requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, > and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and > 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more > requests and the other ns has almost no requests, so we have wasted half of > the number of handlers. > 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController > does not support sharing, only isolation. I think isolation is just a means > to improve the performance of router access to normal ns, not the purpose. It > is impossible for all ns in the cluster to have high loads. On the contrary, > in most scenarios, only a few ns in the cluster have high loads, and the > loads of most other ns are normal. For ns with higher load and ns with normal > load, we need to isolate their handlers so that the ns with higher load will > not affect the performance of ns with lower load. However, for nameservices > that are
[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.
[ https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808450#comment-17808450 ] ASF GitHub Bot commented on HDFS-17293: --- zhangshuyan0 commented on PR #6368: URL: https://github.com/apache/hadoop/pull/6368#issuecomment-1899635293 This PR has corrected the size of the first packet in a new block, which is great. However, due to the original logical problem in `adjustChunkBoundary`, the calculation of the size of the last packet in a block is still problematic, and I think we need a new PR to solve it. https://github.com/apache/hadoop/blob/27ecc23ae7c5cafba6a5ea58d4a68d25bd7507dd/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L531-L543 Line540, when we pass `blockSize - getStreamer().getBytesCurBlock()` to `computePacketChunkSize` as the first parameter, `computePacketChunkSize` is likely to split the data that could have been sent in one packet into two packets for sending. > First packet data + checksum size will be set to 516 bytes when writing to a > new block. > --- > > Key: HDFS-17293 > URL: https://issues.apache.org/jira/browse/HDFS-17293 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.6 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > > First packet size will be set to 516 bytes when writing to a new block. > In method computePacketChunkSize, the parameters psize and csize would be > (0, 512) > when writting to a new block. It should better use writePacketSize. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17313) dfsadmin -reconfig option to start/query reconfig on all live namenodes.
[ https://issues.apache.org/jira/browse/HDFS-17313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808449#comment-17808449 ] ASF GitHub Bot commented on HDFS-17313: --- huangzhaobo99 commented on PR #6395: URL: https://github.com/apache/hadoop/pull/6395#issuecomment-1899632674 @goiri If you have time, Can you also help me review this? At that time, there was a batch refresh of the DN, but the relevant reviewers have not replied to me. This update is for the batch refresh mechanism of nn. Thx. > dfsadmin -reconfig option to start/query reconfig on all live namenodes. > > > Key: HDFS-17313 > URL: https://issues.apache.org/jira/browse/HDFS-17313 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: huangzhaobo >Assignee: huangzhaobo >Priority: Major > Labels: pull-request-available > > https://issues.apache.org/jira/browse/HDFS-16568 Support batch refreshing of > datanode configurations. > There are several nn in the HA or Federated Cluster, and this ticket > implements batch refreshing of nn configurations. > *Implementation method* > # Use the DFSUtil.getNNServiceRpcAddressesForCluster method to parse the > configuration and obtain the addresses of all nn's > # Using two worker threads, currently does not support configuring the > number of worker threads (will be implemented in other ticket if necessary) > *Sample outputs* > {code:java} > $ bin/hdfs dfsadmin -reconfig namenode livenodes start > Started reconfiguration task on node [localhost:50034]. > Started reconfiguration task on node [localhost:50036]. > Started reconfiguration task on node [localhost:50038]. > Started reconfiguration task on node [localhost:50040]. > Starting of reconfiguration task successful on 4 nodes, failed on 0 nodes. > $ bin/hdfs dfsadmin -reconfig namenode livenodes status > Reconfiguring status for node [localhost:50034] > SUCCESS: Changed property dfs.heartbeat.interval > From: "5" > To: "3" > Reconfiguring status for node [localhost:50036] > SUCCESS: Changed property dfs.heartbeat.interval > From: "5" > To: "3" > Reconfiguring status for node [localhost:50038] > SUCCESS: Changed property dfs.heartbeat.interval > From: "5" > To: "3" > Reconfiguring status for node [localhost:50040] > SUCCESS: Changed property dfs.heartbeat.interval > From: "5" > To: "3" > Retrieval of reconfiguration status successful on 4 nodes, failed on 0 > nodes.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
[ https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808446#comment-17808446 ] ASF GitHub Bot commented on HDFS-17302: --- huangzhaobo99 commented on PR #6380: URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1899621973 > @huangzhaobo99 do you still have concerns with the approach? @goiri No worries anymore, I think the sharing mechanism is really good, and percentage based allocation is easier to use. cc @KeeProMise > RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation. > --- > > Key: HDFS-17302 > URL: https://issues.apache.org/jira/browse/HDFS-17302 > Project: Hadoop HDFS > Issue Type: New Feature > Components: rbf >Reporter: Jian Zhang >Assignee: Jian Zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, > HDFS-17302.003.patch > > > h2. Current shortcomings > [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a > StaticRouterRpcFairnessPolicyController to support configuring different > handlers for different ns. Using the StaticRouterRpcFairnessPolicyController > allows the router to isolate different ns, and the ns with a higher load will > not affect the router's access to the ns with a normal load. But the > StaticRouterRpcFairnessPolicyController still falls short in many ways, such > as: > 1. *Configuration is inconvenient and error-prone*: When I use > StaticRouterRpcFairnessPolicyController, I first need to know how many > handlers the router has in total, then I have to know how many nameservices > the router currently has, and then carefully calculate how many handlers to > allocate to each ns so that the sum of handlers for all ns will not exceed > the total handlers of the router, and I also need to consider how many > handlers to allocate to each ns to achieve better performance. Therefore, I > need to be very careful when configuring. Even if I configure only one more > handler for a certain ns, the total number is more than the number of > handlers owned by the router, which will also cause the router to fail to > start. At this time, I had to investigate the reason why the router failed to > start. After finding the reason, I had to reconsider the number of handlers > for each ns. In addition, when I reconfigure the total number of handlers on > the router, I have to re-allocate handlers to each ns, which undoubtedly > increases the complexity of operation and maintenance. > 2. *Extension ns is not supported*: During the running of the router, if a > new ns is added to the cluster and a mount is added for the ns, but because > no handler is allocated for the ns, the ns cannot be accessed through the > router. We must reconfigure the number of handlers and then refresh the > configuration. At this time, the router can access the ns normally. When we > reconfigure the number of handlers, we have to face disadvantage 1: > Configuration is inconvenient and error-prone. > 3. *Waste handlers*: The main purpose of proposing > RouterRpcFairnessPolicyController is to enable the router to access ns with > normal load and not be affected by ns with higher load. First of all, not all > ns have high loads; secondly, ns with high loads do not have high loads 24 > hours a day. It may be that only certain time periods, such as 0 to 8 > o'clock, have high loads, and other time periods have normal loads. Assume > there are 2 ns, and each ns is allocated half of the number of handlers. > Assume that ns1 has many requests from 0 to 14 o'clock, and almost no > requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, > and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and > 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more > requests and the other ns has almost no requests, so we have wasted half of > the number of handlers. > 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController > does not support sharing, only isolation. I think isolation is just a means > to improve the performance of router access to normal ns, not the purpose. It > is impossible for all ns in the cluster to have high loads. On the contrary, > in most scenarios, only a few ns in the cluster have high loads, and the > loads of most other ns are normal. For ns with higher load and ns with normal > load, we need to isolate their handlers so that the ns with higher load will > not affect the performance of ns with lower load. However, for nameservices > that are also under normal load, or are under higher load, we do not need to > isolate them, these ns of the same nature can share
[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.
[ https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808444#comment-17808444 ] ASF GitHub Bot commented on HDFS-17293: --- zhangshuyan0 commented on code in PR #6368: URL: https://github.com/apache/hadoop/pull/6368#discussion_r1458246249 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java: ## @@ -184,6 +186,40 @@ public void testPreventOverflow() throws IOException, NoSuchFieldException, runAdjustChunkBoundary(configuredWritePacketSize, finalWritePacketSize); } + @Test(timeout=6) + public void testFirstPacketSizeInNewBlocks() throws IOException { +final long blockSize = 1L * 1024 * 1024; +final int numDataNodes = 3; +final Configuration dfsConf = new Configuration(); +dfsConf.setLong(DFS_BLOCK_SIZE_KEY, blockSize); +MiniDFSCluster dfsCluster = null; +dfsCluster = new MiniDFSCluster.Builder(dfsConf).numDataNodes(numDataNodes).build(); +dfsCluster.waitActive(); + +DistributedFileSystem fs = dfsCluster.getFileSystem(); +Path fileName = new Path("/testfile.dat"); +FSDataOutputStream fos = fs.create(fileName); +DataChecksum crc32c = DataChecksum.newDataChecksum(DataChecksum.Type.CRC32C, 512); + +long loop = 0; +Random r = new Random(); +byte[] buf = new byte[1 * 1024 * 1024]; +r.nextBytes(buf); +fos.write(buf); +fos.hflush(); + +while (loop < 20) { + r.nextBytes(buf); + fos.write(buf); + fos.hflush(); + loop++; + Assert.assertNotEquals(crc32c.getBytesPerChecksum() + crc32c.getChecksumSize(), Review Comment: It is more appropriate to precisely specify the expected `packetSize` here. Outside the `while loop`: ``` int chunkSize = crc32c.getBytesPerChecksum() + crc32c.getChecksumSize(); int packetContentSize = (dfsConf.getInt(DFS_CLIENT_WRITE_PACKET_SIZE_KEY, DFS_CLIENT_WRITE_PACKET_SIZE_DEFAULT) - PacketHeader.PKT_MAX_HEADER_LEN)/chunkSize*chunkSize; ``` And here: ``` Assert.assertEquals(((DFSOutputStream) fos.getWrappedStream()).packetSize, packetContentSize); ``` ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSOutputStream.java: ## @@ -184,6 +186,40 @@ public void testPreventOverflow() throws IOException, NoSuchFieldException, runAdjustChunkBoundary(configuredWritePacketSize, finalWritePacketSize); } + @Test(timeout=6) + public void testFirstPacketSizeInNewBlocks() throws IOException { +final long blockSize = 1L * 1024 * 1024; +final int numDataNodes = 3; +final Configuration dfsConf = new Configuration(); +dfsConf.setLong(DFS_BLOCK_SIZE_KEY, blockSize); +MiniDFSCluster dfsCluster = null; +dfsCluster = new MiniDFSCluster.Builder(dfsConf).numDataNodes(numDataNodes).build(); +dfsCluster.waitActive(); + +DistributedFileSystem fs = dfsCluster.getFileSystem(); +Path fileName = new Path("/testfile.dat"); +FSDataOutputStream fos = fs.create(fileName); +DataChecksum crc32c = DataChecksum.newDataChecksum(DataChecksum.Type.CRC32C, 512); + +long loop = 0; +Random r = new Random(); +byte[] buf = new byte[1 * 1024 * 1024]; Review Comment: `byte[] buf = new byte[(int) blockSize];` > First packet data + checksum size will be set to 516 bytes when writing to a > new block. > --- > > Key: HDFS-17293 > URL: https://issues.apache.org/jira/browse/HDFS-17293 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.6 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > > First packet size will be set to 516 bytes when writing to a new block. > In method computePacketChunkSize, the parameters psize and csize would be > (0, 512) > when writting to a new block. It should better use writePacketSize. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808427#comment-17808427 ] ASF GitHub Bot commented on HDFS-17332: --- ctrezzo merged PR #6446: URL: https://github.com/apache/hadoop/pull/6446 > DFSInputStream: avoid logging stacktrace until when we really need to fail a > read request with a MissingBlockException > -- > > Key: HDFS-17332 > URL: https://issues.apache.org/jira/browse/HDFS-17332 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > In DFSInputStream#actualGetFromOneDataNode(), it would send the exception > stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most > cases, the read request will be served successfully by reading from the next > available DN. The existence of exception stacktrace in the log has caused > multiple hadoop users at Linkedin to consider this WARN message as the > RC/fatal error for their jobs. We would like to improve the log message and > avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The > stackTrace when reading reach DN is sent to the log only when we really need > to fail a read request (when chooseDataNode()/refetchLocations() throws a > BlockMissingException). > > Example stack trace > {code:java} > [12]:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: > Failed to connect to 10.150.91.13/10.150.91.13:71 for file > //part--95b9909c-zzz-c000.avro for block > BP-364971551-DatanodeIP-1448516588954:blk__129864739321:java.net.SocketTimeoutException: > 6 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/ip:40492 > remote=datanodeIP:71] [12]:java.net.SocketTimeoutException: 6 > millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/localIp:40492 > remote=datanodeIP:71] [12]: at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > [12]: at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) > [12]: at java.io.FilterInputStream.read(FilterInputStream.java:83) > [12]: at > org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753) > [12]: at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) > [12]: at > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) > [12]: at > hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108) > [12]: at > com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39) > [12]: at > hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) > [12]: at > org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153) > [12]: at > org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) > [12]: at > org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) > [12]: at > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands,
[jira] [Updated] (HDFS-17341) Support dedicated user queues in Namenode FairCallQueue
[ https://issues.apache.org/jira/browse/HDFS-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei Yang updated HDFS-17341: Description: Some service users today in namenode like ETL, metrics collection, ad-hoc users that are critical to run business critical job accounts for many traffic in namenode and shouldn't be throttled the same way as other individual users in FCQ. There is [feature|https://issues.apache.org/jira/browse/HADOOP-17165] in namenode to always prioritize some service users to not subject to FCQ scheduling. (Those users are always p0) but it is not perfect and it doesn't account for traffic surge from those users. The idea is to allocate dedicated rpc queues for those service users with bounded queue capacity and allocate processing weight for those users. If queue is full, those users are expected to backoff and retry. New configs: {code:java} "faircallqueue.reserved.users"; // list of service users that are assigned to dedicated queue "faircallqueue.reserved.users.max"; // max number of service users allowed "faircallqueue.reserved.users.capacities"; // custom queue capacities for each service user "faircallqueue.multiplexer.reserved.weights"; // processing weights for each dedicated queue{code} For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b) FCQ would look like: {code:java} P0: shared queue P1: shared queue P2: shared queue P3: shared queue P4: dedicated for user a P5: dedicated for user b{code} {color:#172b4d}The Multiplexer would have following weights{color} {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color} {color:#172b4d}reserved queue weights=[3, 2]{color} {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total cycles.{color} was: Some service users today in namenode like ETL, metrics collection, ad-hoc users that are critical to run business critical job accounts for many traffic in namenode and shouldn't be throttled the same way as other individual users in FCQ. There is feature in namenode to always prioritize some service users to not subject to FCQ scheduling. (Those users are always p0) but it is not perfect and it doesn't account for traffic surge from those users. The idea is to allocate dedicated rpc queues for those service users with bounded queue capacity and allocate processing weight for those users. If queue is full, those users are expected to backoff and retry. New configs: {code:java} "faircallqueue.reserved.users"; // list of service users that are assigned to dedicated queue "faircallqueue.reserved.users.max"; // max number of service users allowed "faircallqueue.reserved.users.capacities"; // custom queue capacities for each service user "faircallqueue.multiplexer.reserved.weights"; // processing weights for each dedicated queue{code} For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b) FCQ would look like: {code:java} P0: shared queue P1: shared queue P2: shared queue P3: shared queue P4: dedicated for user a P5: dedicated for user b{code} {color:#172b4d}The Multiplexer would have following weights{color} {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color} {color:#172b4d}reserved queue weights=[3, 2]{color} {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total cycles.{color} > Support dedicated user queues in Namenode FairCallQueue > --- > > Key: HDFS-17341 > URL: https://issues.apache.org/jira/browse/HDFS-17341 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.10.0, 3.4.0 >Reporter: Lei Yang >Priority: Major > Labels: pull-request-available > > Some service users today in namenode like ETL, metrics collection, ad-hoc > users that are critical to run business critical job accounts for many > traffic in namenode and shouldn't be throttled the same way as other > individual users in FCQ. > There is [feature|https://issues.apache.org/jira/browse/HADOOP-17165] in > namenode to always prioritize some service users to not subject to FCQ > scheduling. (Those users are always p0) but it is not perfect and it doesn't > account for traffic surge from those users. > The idea is to allocate dedicated rpc queues for those service users with > bounded queue capacity and allocate processing weight for those users. If > queue is full, those users are expected to backoff and retry. > > New configs: > {code:java} > "faircallqueue.reserved.users"; // list of service users that are assigned to > dedicated queue > "faircallqueue.reserved.users.max"; // max number of service users allowed > "faircallqueue.reserved.users.capacities"; // custom queue capacities for > each service user > "faircallqueue.multiplexer.reserved.weights";
[jira] [Commented] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
[ https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808400#comment-17808400 ] ASF GitHub Bot commented on HDFS-17343: --- slfan1989 commented on PR #6457: URL: https://github.com/apache/hadoop/pull/6457#issuecomment-1899387396 @ayushtkn Can you help review this PR? Thank you very much! > Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR > - > > Key: HDFS-17343 > URL: https://issues.apache.org/jira/browse/HDFS-17343 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > > When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause > mis-order of ibr and fbr on datanode. After discussion, we decided to revert > HDFS-16016. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17341) Support dedicated user queues in Namenode FairCallQueue
[ https://issues.apache.org/jira/browse/HDFS-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808392#comment-17808392 ] Lei Yang commented on HDFS-17341: - [~hexiaoqiao] Thanks for your comment. {quote}One concerns, we evaluate the request if high- or low- priority based on user only, but not all requests from this user are always high or low priority in fact. {quote} Not sure I understand this. The idea is to get some critical service users exempt from existing FCQ mechanism to make sure they are not throttled in the same way as regular users in shared queue. Meanwhile, those users should not flood the entire queue if there are traffic surge(https://issues.apache.org/jira/browse/HADOOP-17165 can assign service user to p0 but it cannot solve the traffic surge from those users). We can assign weights for those users to ensure they are not exceeding certain % of total processing cycles. > Support dedicated user queues in Namenode FairCallQueue > --- > > Key: HDFS-17341 > URL: https://issues.apache.org/jira/browse/HDFS-17341 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.10.0, 3.4.0 >Reporter: Lei Yang >Priority: Major > Labels: pull-request-available > > Some service users today in namenode like ETL, metrics collection, ad-hoc > users that are critical to run business critical job accounts for many > traffic in namenode and shouldn't be throttled the same way as other > individual users in FCQ. > There is feature in namenode to always prioritize some service users to not > subject to FCQ scheduling. (Those users are always p0) but it is not perfect > and it doesn't account for traffic surge from those users. > The idea is to allocate dedicated rpc queues for those service users with > bounded queue capacity and allocate processing weight for those users. If > queue is full, those users are expected to backoff and retry. > > New configs: > {code:java} > "faircallqueue.reserved.users"; // list of service users that are assigned to > dedicated queue > "faircallqueue.reserved.users.max"; // max number of service users allowed > "faircallqueue.reserved.users.capacities"; // custom queue capacities for > each service user > "faircallqueue.multiplexer.reserved.weights"; // processing weights for each > dedicated queue{code} > For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b) > FCQ would look like: > > {code:java} > P0: shared queue > P1: shared queue > P2: shared queue > P3: shared queue > P4: dedicated for user a > P5: dedicated for user b{code} > {color:#172b4d}The Multiplexer would have following weights{color} > {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color} > {color:#172b4d}reserved queue weights=[3, 2]{color} > {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total > cycles.{color} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17341) Support dedicated user queues in Namenode FairCallQueue
[ https://issues.apache.org/jira/browse/HDFS-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei Yang updated HDFS-17341: Description: Some service users today in namenode like ETL, metrics collection, ad-hoc users that are critical to run business critical job accounts for many traffic in namenode and shouldn't be throttled the same way as other individual users in FCQ. There is feature in namenode to always prioritize some service users to not subject to FCQ scheduling. (Those users are always p0) but it is not perfect and it doesn't account for traffic surge from those users. The idea is to allocate dedicated rpc queues for those service users with bounded queue capacity and allocate processing weight for those users. If queue is full, those users are expected to backoff and retry. New configs: {code:java} "faircallqueue.reserved.users"; // list of service users that are assigned to dedicated queue "faircallqueue.reserved.users.max"; // max number of service users allowed "faircallqueue.reserved.users.capacities"; // custom queue capacities for each service user "faircallqueue.multiplexer.reserved.weights"; // processing weights for each dedicated queue{code} For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b) FCQ would look like: {code:java} P0: shared queue P1: shared queue P2: shared queue P3: shared queue P4: dedicated for user a P5: dedicated for user b{code} {color:#172b4d}The Multiplexer would have following weights{color} {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color} {color:#172b4d}reserved queue weights=[3, 2]{color} {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total cycles.{color} was: Some service users today in namenode like ETL, metrics collection, ad-hoc users that are critical to run business critical job accounts for many traffic in namenode and shouldn't be throttled the same way as other individual users in FCQ. There is feature in namenode to always prioritize some service users to not subject to FCQ scheduling. (Those users are always p0) but it is not perfect and it doesn't account for traffic surge from those users. The idea is to allocate dedicated rpc queues for those service users with bounded queue capacity and allocate processing weight for those users. If queue is full, those users are expected to backoff and retry. New configs: {code:java} "faircallqueue.reserved.users"; // list of service users that are assigned to dedicated queue "faircallqueue.reserved.users.max"; // max number of service users allowed "faircallqueue.reserved.users.capacities"; // custom queue capacities for each service user "faircallqueue.multiplexer.reserved.weights"; // processing weights for each dedicated queue{code} For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b) FCQ would look like: {code:java} P0: shared queue P1: shared queue P2: shared queue P3: shared queue P4: dedicated for user a P5: dedicated for user b{code} {color:#172b4d}The WRM would have following weights{color} {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color} {color:#172b4d}reserved queue weights=[3, 2]{color} {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total cycles.{color} > Support dedicated user queues in Namenode FairCallQueue > --- > > Key: HDFS-17341 > URL: https://issues.apache.org/jira/browse/HDFS-17341 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.10.0, 3.4.0 >Reporter: Lei Yang >Priority: Major > Labels: pull-request-available > > Some service users today in namenode like ETL, metrics collection, ad-hoc > users that are critical to run business critical job accounts for many > traffic in namenode and shouldn't be throttled the same way as other > individual users in FCQ. > There is feature in namenode to always prioritize some service users to not > subject to FCQ scheduling. (Those users are always p0) but it is not perfect > and it doesn't account for traffic surge from those users. > The idea is to allocate dedicated rpc queues for those service users with > bounded queue capacity and allocate processing weight for those users. If > queue is full, those users are expected to backoff and retry. > > New configs: > {code:java} > "faircallqueue.reserved.users"; // list of service users that are assigned to > dedicated queue > "faircallqueue.reserved.users.max"; // max number of service users allowed > "faircallqueue.reserved.users.capacities"; // custom queue capacities for > each service user > "faircallqueue.multiplexer.reserved.weights"; // processing weights for each > dedicated queue{code} > For instance, for a FCQ with 4 priority levels, 2 reserved
[jira] [Updated] (HDFS-17341) Support dedicated user queues in Namenode FairCallQueue
[ https://issues.apache.org/jira/browse/HDFS-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lei Yang updated HDFS-17341: Description: Some service users today in namenode like ETL, metrics collection, ad-hoc users that are critical to run business critical job accounts for many traffic in namenode and shouldn't be throttled the same way as other individual users in FCQ. There is feature in namenode to always prioritize some service users to not subject to FCQ scheduling. (Those users are always p0) but it is not perfect and it doesn't account for traffic surge from those users. The idea is to allocate dedicated rpc queues for those service users with bounded queue capacity and allocate processing weight for those users. If queue is full, those users are expected to backoff and retry. New configs: {code:java} "faircallqueue.reserved.users"; // list of service users that are assigned to dedicated queue "faircallqueue.reserved.users.max"; // max number of service users allowed "faircallqueue.reserved.users.capacities"; // custom queue capacities for each service user "faircallqueue.multiplexer.reserved.weights"; // processing weights for each dedicated queue{code} For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b) FCQ would look like: {code:java} P0: shared queue P1: shared queue P2: shared queue P3: shared queue P4: dedicated for user a P5: dedicated for user b{code} {color:#172b4d}The WRM would have following weights{color} {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color} {color:#172b4d}reserved queue weights=[3, 2]{color} {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total cycles.{color} was: Some service users today in namenode like ETL, metrics collection, ad-hoc users that are critical to run business critical job accounts for many traffic in namenode and shouldn't be throttled the same way as other individual users in FCQ. There is feature in namenode to always prioritize some service users to not subject to FCQ scheduling. (Those users are always p0) but it is not perfect and it doesn't account for traffic surge from those users. The idea is to allocate dedicated rpc queues for those service users with bounded queue capacity and allocate processing weight for those users. If queue is full, those users are expected to backoff and retry. New configs: {code:java} "faircallqueue.reserved.users"; // list of service users that are assigned to dedicated queue "faircallqueue.reserved.users.max"; // max number of service users allowed "faircallqueue.reserved.users.capacities"; // custom queue capacities for each service user "faircallqueue.multiplexer.reserved.weights"; // processing weights for each dedicated queue{code} > Support dedicated user queues in Namenode FairCallQueue > --- > > Key: HDFS-17341 > URL: https://issues.apache.org/jira/browse/HDFS-17341 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.10.0, 3.4.0 >Reporter: Lei Yang >Priority: Major > Labels: pull-request-available > > Some service users today in namenode like ETL, metrics collection, ad-hoc > users that are critical to run business critical job accounts for many > traffic in namenode and shouldn't be throttled the same way as other > individual users in FCQ. > There is feature in namenode to always prioritize some service users to not > subject to FCQ scheduling. (Those users are always p0) but it is not perfect > and it doesn't account for traffic surge from those users. > The idea is to allocate dedicated rpc queues for those service users with > bounded queue capacity and allocate processing weight for those users. If > queue is full, those users are expected to backoff and retry. > > New configs: > {code:java} > "faircallqueue.reserved.users"; // list of service users that are assigned to > dedicated queue > "faircallqueue.reserved.users.max"; // max number of service users allowed > "faircallqueue.reserved.users.capacities"; // custom queue capacities for > each service user > "faircallqueue.multiplexer.reserved.weights"; // processing weights for each > dedicated queue{code} > For instance, for a FCQ with 4 priority levels, 2 reserved users(a, b) > FCQ would look like: > > {code:java} > P0: shared queue > P1: shared queue > P2: shared queue > P3: shared queue > P4: dedicated for user a > P5: dedicated for user b{code} > {color:#172b4d}The WRM would have following weights{color} > {color:#172b4d}shared queue default weights: [8, 4, 2, 1]{color} > {color:#172b4d}reserved queue weights=[3, 2]{color} > {color:#172b4d}So user a gets 15% of total cycles, user b gets 10% of total > cycles.{color} > > -- This message was sent by Atlassian
[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block
[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808371#comment-17808371 ] ASF GitHub Bot commented on HDFS-17342: --- hadoop-yetus commented on PR #6464: URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1899203477 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 23s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 32m 2s | | trunk passed | | +1 :green_heart: | compile | 0m 41s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 0m 38s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 38s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 43s | | trunk passed | | +1 :green_heart: | javadoc | 0m 39s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 0s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 42s | | trunk passed | | +1 :green_heart: | shadedclient | 20m 29s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 39s | | the patch passed | | +1 :green_heart: | compile | 0m 45s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 0m 45s | | the patch passed | | +1 :green_heart: | compile | 0m 32s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 32s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 30s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 87 unchanged - 0 fixed = 88 total (was 87) | | +1 :green_heart: | mvnsite | 0m 37s | | the patch passed | | +1 :green_heart: | javadoc | 0m 29s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 2s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 40s | | the patch passed | | +1 :green_heart: | shadedclient | 21m 7s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 601m 20s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 27s | | The patch does not generate ASF License warnings. | | | | 689m 46s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.TestDFSStripedInputStream | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl | | | hadoop.hdfs.TestParallelShortCircuitReadNoChecksum | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6464 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 30043234e0f6 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 83eab24c7696017a24412340514a6977b6a394af | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions |
[jira] [Commented] (HDFS-17339) BPServiceActor should skip cacheReport when one blockPool does not have CacheBlock on this DataNode
[ https://issues.apache.org/jira/browse/HDFS-17339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808361#comment-17808361 ] ASF GitHub Bot commented on HDFS-17339: --- hadoop-yetus commented on PR #6456: URL: https://github.com/apache/hadoop/pull/6456#issuecomment-1899150670 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 34s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 9s | | trunk passed | | +1 :green_heart: | compile | 1m 19s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 10s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 9s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 22s | | trunk passed | | +1 :green_heart: | javadoc | 1m 5s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 14s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 23s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 9s | | the patch passed | | +1 :green_heart: | compile | 1m 11s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 11s | | the patch passed | | +1 :green_heart: | compile | 1m 4s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 4s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 55s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 11s | | the patch passed | | +1 :green_heart: | javadoc | 0m 52s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 18s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 32s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 225m 18s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6456/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 41s | | The patch does not generate ASF License warnings. | | | | 360m 57s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6456/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6456 | | JIRA Issue | HDFS-17339 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux efee4dd1d356 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 243af8bea73098685bbca84a7c22e9e98fedcd57 | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6456/2/testReport/ | | Max. process+thread count | 4168 (vs. ulimit of 5500) | | modules |
[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
[ https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808352#comment-17808352 ] ASF GitHub Bot commented on HDFS-17302: --- goiri commented on PR #6380: URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1899080274 @huangzhaobo99 do you still have concerns with the approach? > RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation. > --- > > Key: HDFS-17302 > URL: https://issues.apache.org/jira/browse/HDFS-17302 > Project: Hadoop HDFS > Issue Type: New Feature > Components: rbf >Reporter: Jian Zhang >Assignee: Jian Zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, > HDFS-17302.003.patch > > > h2. Current shortcomings > [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a > StaticRouterRpcFairnessPolicyController to support configuring different > handlers for different ns. Using the StaticRouterRpcFairnessPolicyController > allows the router to isolate different ns, and the ns with a higher load will > not affect the router's access to the ns with a normal load. But the > StaticRouterRpcFairnessPolicyController still falls short in many ways, such > as: > 1. *Configuration is inconvenient and error-prone*: When I use > StaticRouterRpcFairnessPolicyController, I first need to know how many > handlers the router has in total, then I have to know how many nameservices > the router currently has, and then carefully calculate how many handlers to > allocate to each ns so that the sum of handlers for all ns will not exceed > the total handlers of the router, and I also need to consider how many > handlers to allocate to each ns to achieve better performance. Therefore, I > need to be very careful when configuring. Even if I configure only one more > handler for a certain ns, the total number is more than the number of > handlers owned by the router, which will also cause the router to fail to > start. At this time, I had to investigate the reason why the router failed to > start. After finding the reason, I had to reconsider the number of handlers > for each ns. In addition, when I reconfigure the total number of handlers on > the router, I have to re-allocate handlers to each ns, which undoubtedly > increases the complexity of operation and maintenance. > 2. *Extension ns is not supported*: During the running of the router, if a > new ns is added to the cluster and a mount is added for the ns, but because > no handler is allocated for the ns, the ns cannot be accessed through the > router. We must reconfigure the number of handlers and then refresh the > configuration. At this time, the router can access the ns normally. When we > reconfigure the number of handlers, we have to face disadvantage 1: > Configuration is inconvenient and error-prone. > 3. *Waste handlers*: The main purpose of proposing > RouterRpcFairnessPolicyController is to enable the router to access ns with > normal load and not be affected by ns with higher load. First of all, not all > ns have high loads; secondly, ns with high loads do not have high loads 24 > hours a day. It may be that only certain time periods, such as 0 to 8 > o'clock, have high loads, and other time periods have normal loads. Assume > there are 2 ns, and each ns is allocated half of the number of handlers. > Assume that ns1 has many requests from 0 to 14 o'clock, and almost no > requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, > and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and > 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more > requests and the other ns has almost no requests, so we have wasted half of > the number of handlers. > 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController > does not support sharing, only isolation. I think isolation is just a means > to improve the performance of router access to normal ns, not the purpose. It > is impossible for all ns in the cluster to have high loads. On the contrary, > in most scenarios, only a few ns in the cluster have high loads, and the > loads of most other ns are normal. For ns with higher load and ns with normal > load, we need to isolate their handlers so that the ns with higher load will > not affect the performance of ns with lower load. However, for nameservices > that are also under normal load, or are under higher load, we do not need to > isolate them, these ns of the same nature can share the handlers of the > router; The performance is better than assigning a fixed number of handlers > to each ns, because each ns can use all the handlers of
[jira] [Commented] (HDFS-17343) Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR
[ https://issues.apache.org/jira/browse/HDFS-17343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808289#comment-17808289 ] ASF GitHub Bot commented on HDFS-17343: --- virajjasani commented on PR #6457: URL: https://github.com/apache/hadoop/pull/6457#issuecomment-1898800060 Thanks for working on the revert to unblock 3.4.0! > Revert HDFS-16016. BPServiceActor to provide new thread to handle IBR > - > > Key: HDFS-17343 > URL: https://issues.apache.org/jira/browse/HDFS-17343 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.4.0 >Reporter: Shilun Fan >Assignee: Shilun Fan >Priority: Major > Labels: pull-request-available > > When preparing for hadoop-3.4.0 release, we found that HDFS-16016 may cause > mis-order of ibr and fbr on datanode. After discussion, we decided to revert > HDFS-16016. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block
[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808272#comment-17808272 ] ASF GitHub Bot commented on HDFS-17342: --- hadoop-yetus commented on PR #6464: URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1898730447 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 22s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 34m 14s | | trunk passed | | +1 :green_heart: | compile | 0m 44s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 0m 41s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 40s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 46s | | trunk passed | | +1 :green_heart: | javadoc | 0m 43s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 5s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 58s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 48s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 41s | | the patch passed | | +1 :green_heart: | compile | 0m 39s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 0m 39s | | the patch passed | | +1 :green_heart: | compile | 0m 37s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 0m 37s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 32s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 87 unchanged - 0 fixed = 88 total (was 87) | | +1 :green_heart: | mvnsite | 0m 41s | | the patch passed | | +1 :green_heart: | javadoc | 0m 31s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 3s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 1m 53s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 19s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 211m 24s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 21s | | The patch does not generate ASF License warnings. | | | | 306m 44s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestDFSStripedInputStream | | | hadoop.hdfs.TestEncryptionZonesWithKMS | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.server.namenode.TestReconstructStripedBlocks | | | hadoop.hdfs.TestDFSStripedOutputStreamWithRandomECPolicy | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6464 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 0ea7c919d276 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk /
[jira] [Resolved] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html
[ https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuyan Zhang resolved HDFS-17331. - Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Target Version/s: 3.5.0 Assignee: lei w Resolution: Fixed > Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html > --- > > Key: HDFS-17331 > URL: https://issues.apache.org/jira/browse/HDFS-17331 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Assignee: lei w >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > Attachments: After fix.png, Before fix.png > > > Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html
[ https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808194#comment-17808194 ] ASF GitHub Bot commented on HDFS-17331: --- zhangshuyan0 merged PR #6429: URL: https://github.com/apache/hadoop/pull/6429 > Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html > --- > > Key: HDFS-17331 > URL: https://issues.apache.org/jira/browse/HDFS-17331 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: lei w >Priority: Major > Labels: pull-request-available > Attachments: After fix.png, Before fix.png > > > Blocks are always -1 and DataNode`s version are always UNKNOWN in > federationhealth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17331) Fix Blocks are always -1 and DataNode`s version are always UNKNOWN in federationhealth.html
[ https://issues.apache.org/jira/browse/HDFS-17331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808185#comment-17808185 ] ASF GitHub Bot commented on HDFS-17331: --- hadoop-yetus commented on PR #6429: URL: https://github.com/apache/hadoop/pull/6429#issuecomment-1898411792 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 21s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | buf | 0m 0s | | buf was not available. | | +0 :ok: | buf | 0m 0s | | buf was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 13m 44s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 19m 17s | | trunk passed | | +1 :green_heart: | compile | 2m 53s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 2m 46s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 44s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 58s | | trunk passed | | +1 :green_heart: | javadoc | 0m 53s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 48s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 2m 11s | | trunk passed | | +1 :green_heart: | shadedclient | 19m 32s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 20s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 0m 44s | | the patch passed | | +1 :green_heart: | compile | 2m 45s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | cc | 2m 45s | | the patch passed | | +1 :green_heart: | javac | 2m 45s | | the patch passed | | +1 :green_heart: | compile | 2m 42s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | cc | 2m 42s | | the patch passed | | +1 :green_heart: | javac | 2m 42s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 37s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 46s | | the patch passed | | +1 :green_heart: | javadoc | 0m 36s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 41s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 2m 14s | | the patch passed | | +1 :green_heart: | shadedclient | 19m 34s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 1m 49s | | hadoop-hdfs-client in the patch passed. | | +1 :green_heart: | unit | 18m 59s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 25s | | The patch does not generate ASF License warnings. | | | | 118m 59s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6429/9/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6429 | | JIRA Issue | HDFS-17331 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets cc buflint bufcompat | | uname | Linux b6670bf98a38 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 6ee5a462a0abb05345f2bd3fbe71ca2e4bb54569 | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04
[jira] [Commented] (HDFS-17332) DFSInputStream: avoid logging stacktrace until when we really need to fail a read request with a MissingBlockException
[ https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808146#comment-17808146 ] ASF GitHub Bot commented on HDFS-17332: --- hadoop-yetus commented on PR #6446: URL: https://github.com/apache/hadoop/pull/6446#issuecomment-1898272516 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 49s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 13m 49s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 35m 28s | | trunk passed | | +1 :green_heart: | compile | 6m 12s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 5m 46s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 19s | | trunk passed | | +1 :green_heart: | javadoc | 1m 53s | | trunk passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 2m 20s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 58s | | trunk passed | | +1 :green_heart: | shadedclient | 39m 45s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 31s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 2s | | the patch passed | | +1 :green_heart: | compile | 5m 54s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 5m 54s | | the patch passed | | +1 :green_heart: | compile | 5m 45s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 5m 45s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 19s | | hadoop-hdfs-project: The patch generated 0 new + 43 unchanged - 1 fixed = 43 total (was 44) | | +1 :green_heart: | mvnsite | 2m 2s | | the patch passed | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 2m 5s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 1s | | the patch passed | | +1 :green_heart: | shadedclient | 39m 45s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 24s | | hadoop-hdfs-client in the patch passed. | | +1 :green_heart: | unit | 254m 38s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 45s | | The patch does not generate ASF License warnings. | | | | 441m 21s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6446/10/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6446 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 4426bef6a3ba 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 22d136773a102e3f317e6b785cc9c15f41308f4d | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6446/10/testReport/ | | Max. process+thread count | 2189 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project | | Console output |
[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block
[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808106#comment-17808106 ] ASF GitHub Bot commented on HDFS-17342: --- hadoop-yetus commented on PR #6464: URL: https://github.com/apache/hadoop/pull/6464#issuecomment-1898114627 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 25s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | -1 :x: | mvninstall | 0m 21s | [/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-mvninstall-root.txt) | root in trunk failed. | | -1 :x: | compile | 0m 21s | [/branch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt) | hadoop-hdfs in trunk failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04. | | -1 :x: | compile | 0m 21s | [/branch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt) | hadoop-hdfs in trunk failed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08. | | -0 :warning: | checkstyle | 0m 20s | [/buildtool-branch-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/buildtool-branch-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | The patch fails to run checkstyle in hadoop-hdfs | | -1 :x: | mvnsite | 0m 21s | [/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in trunk failed. | | -1 :x: | javadoc | 0m 27s | [/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt) | hadoop-hdfs in trunk failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04. | | -1 :x: | javadoc | 3m 21s | [/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_392-8u392-ga-1~20.04-b08.txt) | hadoop-hdfs in trunk failed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08. | | -1 :x: | spotbugs | 0m 21s | [/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in trunk failed. | | +1 :green_heart: | shadedclient | 5m 33s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | -1 :x: | mvninstall | 0m 22s | [/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | -1 :x: | compile | 0m 20s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt) | hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04. | | -1 :x: | javac | 0m 20s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6464/1/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04.txt) | hadoop-hdfs in the patch failed with JDK
[jira] [Updated] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block
[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-17342: -- Labels: pull-request-available (was: ) > Fix DataNode may invalidates normal block causing missing block > --- > > Key: HDFS-17342 > URL: https://issues.apache.org/jira/browse/HDFS-17342 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When users read an append file, occasional exceptions may occur, such as > org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx. > This can happen if one thread is reading the block while writer thread is > finalizing it simultaneously. > *Root cause:* > # The reader thread obtains a RBW replica from VolumeMap, such as: > blk_xxx_xxx[RBW] and the data file should be in /XXX/rbw/blk_xxx. > # Simultaneously, the writer thread will finalize this block, moving it from > the RBW directory to the FINALIZE directory. the data file is move from > /XXX/rbw/block_xxx to /XXX/finalize/block_xxx. > # The reader thread attempts to open this data input stream but encounters a > FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file > /XXX/rbw/blk_xxx_xxx doesn't exist at this moment. > # The reader thread will treats this block as corrupt, removes the replica > from the volume map, and the DataNode reports the deleted block to the > NameNode. > # The NameNode removes this replica for the block. > # If the current file replication is 1, this file will cause a missing block > issue until this DataNode executes the DirectoryScanner again. > As described above, when the reader thread encountered FileNotFoundException > is as expected, because the file is moved. > So we need to add a double check to the invalidateMissingBlock logic to > verify whether the data file or meta file exists to avoid similar cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17342) Fix DataNode may invalidates normal block causing missing block
[ https://issues.apache.org/jira/browse/HDFS-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808090#comment-17808090 ] ASF GitHub Bot commented on HDFS-17342: --- haiyang1987 opened a new pull request, #6464: URL: https://github.com/apache/hadoop/pull/6464 ### Description of PR https://issues.apache.org/jira/browse/HDFS-17342 When users read an append file, occasional exceptions may occur, such as org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx. This can happen if one thread is reading the block while writer thread is finalizing it simultaneously. **Root cause:** 1. The reader thread obtains a RBW replica from VolumeMap, such as: blk_xxx_xxx[RBW] and the data file should be in /XXX/rbw/blk_xxx. 2. Simultaneously, the writer thread will finalize this block, moving it from the RBW directory to the FINALIZE directory. the data file is move from /XXX/rbw/block_xxx to /XXX/finalize/block_xxx. 3. The reader thread attempts to open this data input stream but encounters a FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file /XXX/rbw/blk_xxx_xxx doesn't exist at this moment. 4. The reader thread will treats this block as corrupt, removes the replica from the volume map, and the DataNode reports the deleted block to the NameNode. 5. The NameNode removes this replica for the block. 6. If the current file replication is 1, this file will cause a missing block issue until this DataNode executes the DirectoryScanner again. As described above, when the reader thread encountered FileNotFoundException is as expected, because the file is moved. So we need to add a double check to the invalidateMissingBlock logic to verify whether the data file or meta file exists to avoid similar cases. > Fix DataNode may invalidates normal block causing missing block > --- > > Key: HDFS-17342 > URL: https://issues.apache.org/jira/browse/HDFS-17342 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > When users read an append file, occasional exceptions may occur, such as > org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: xxx. > This can happen if one thread is reading the block while writer thread is > finalizing it simultaneously. > *Root cause:* > # The reader thread obtains a RBW replica from VolumeMap, such as: > blk_xxx_xxx[RBW] and the data file should be in /XXX/rbw/blk_xxx. > # Simultaneously, the writer thread will finalize this block, moving it from > the RBW directory to the FINALIZE directory. the data file is move from > /XXX/rbw/block_xxx to /XXX/finalize/block_xxx. > # The reader thread attempts to open this data input stream but encounters a > FileNotFoundException because the data file /XXX/rbw/blk_xxx or meta file > /XXX/rbw/blk_xxx_xxx doesn't exist at this moment. > # The reader thread will treats this block as corrupt, removes the replica > from the volume map, and the DataNode reports the deleted block to the > NameNode. > # The NameNode removes this replica for the block. > # If the current file replication is 1, this file will cause a missing block > issue until this DataNode executes the DirectoryScanner again. > As described above, when the reader thread encountered FileNotFoundException > is as expected, because the file is moved. > So we need to add a double check to the invalidateMissingBlock logic to > verify whether the data file or meta file exists to avoid similar cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method
[ https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] farmmamba resolved HDFS-17334. -- Resolution: Not A Problem > FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait > method > --- > > Key: HDFS-17334 > URL: https://issues.apache.org/jira/browse/HDFS-17334 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.6 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In method FSEditLogAsync#enqueueEdit , there exist the below codes: > {code:java} > if (Thread.holdsLock(this)) { > // if queue is full, synchronized caller must immediately relinquish > // the monitor before re-offering to avoid deadlock with sync thread > // which needs the monitor to write transactions. > int permits = overflowMutex.drainPermits(); > try { > do { > this.wait(1000); // will be notified by next logSync. > } while (!editPendingQ.offer(edit)); > } finally { > overflowMutex.release(permits); > } > } {code} > It maybe invoke this.wait(1000) without having object this's monitor. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method
[ https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808077#comment-17808077 ] ASF GitHub Bot commented on HDFS-17334: --- hfutatzhanghb closed pull request #6434: HDFS-17334. FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method. URL: https://github.com/apache/hadoop/pull/6434 > FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait > method > --- > > Key: HDFS-17334 > URL: https://issues.apache.org/jira/browse/HDFS-17334 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.6 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In method FSEditLogAsync#enqueueEdit , there exist the below codes: > {code:java} > if (Thread.holdsLock(this)) { > // if queue is full, synchronized caller must immediately relinquish > // the monitor before re-offering to avoid deadlock with sync thread > // which needs the monitor to write transactions. > int permits = overflowMutex.drainPermits(); > try { > do { > this.wait(1000); // will be notified by next logSync. > } while (!editPendingQ.offer(edit)); > } finally { > overflowMutex.release(permits); > } > } {code} > It maybe invoke this.wait(1000) without having object this's monitor. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method
[ https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808076#comment-17808076 ] ASF GitHub Bot commented on HDFS-17334: --- hfutatzhanghb commented on PR #6434: URL: https://github.com/apache/hadoop/pull/6434#issuecomment-1898012100 > > > Line211 has already ensured that we have a monitor for this object: > > > https://github.com/apache/hadoop/blob/ba6ada73acc2bce560878272c543534c21c76f22/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java#L211-L223 > > > > > > So, I think the description in this PR is not a problem. What's your opinion? @hfutatzhanghb > > > > > > @zhangshuyan0 Sir, `this.wait(1000);` is in do-while loop, when we invoke `this.wait(1000)` at first time, it will release object monitor. But in extreme situation, it will throw Exception when invoke `this.wait(1000)` at the second time, because current thread does not hold the object monitor. Waiting for your response~ > > Let's see [java doc](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-) : > > > Thus, on return from the wait method, the synchronization state of the object and of thread T is exactly as it was when the wait method was invoked. > > Therefore, after `this.wait(1000)` returns at first time, it obtains the monitor again. I think no exception will be thrown here. By the way, in this [java doc](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-) , `synchronized -> while loop` is showed as a recommended usage. Looking forward to your response. Sir, Thanks a lot for your explanations here. I will close this PR laterly. Thanks again. > FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait > method > --- > > Key: HDFS-17334 > URL: https://issues.apache.org/jira/browse/HDFS-17334 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.6 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In method FSEditLogAsync#enqueueEdit , there exist the below codes: > {code:java} > if (Thread.holdsLock(this)) { > // if queue is full, synchronized caller must immediately relinquish > // the monitor before re-offering to avoid deadlock with sync thread > // which needs the monitor to write transactions. > int permits = overflowMutex.drainPermits(); > try { > do { > this.wait(1000); // will be notified by next logSync. > } while (!editPendingQ.offer(edit)); > } finally { > overflowMutex.release(permits); > } > } {code} > It maybe invoke this.wait(1000) without having object this's monitor. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17311) RBF: ConnectionManager creatorQueue should offer a pool that is not already in creatorQueue.
[ https://issues.apache.org/jira/browse/HDFS-17311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808074#comment-17808074 ] ASF GitHub Bot commented on HDFS-17311: --- LiuGuH commented on PR #6392: URL: https://github.com/apache/hadoop/pull/6392#issuecomment-1898009075 > LGTM @slfan1989 any further comments? @slfan1989 , Hello sir , any further comments? Thanks. > RBF: ConnectionManager creatorQueue should offer a pool that is not already > in creatorQueue. > > > Key: HDFS-17311 > URL: https://issues.apache.org/jira/browse/HDFS-17311 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: liuguanghua >Assignee: liuguanghua >Priority: Major > Labels: pull-request-available > > In the Router, find blow log > > 2023-12-29 15:18:54,799 ERROR > org.apache.hadoop.hdfs.server.federation.router.ConnectionManager: Cannot add > more than 2048 connections at the same time > > The log indicates that ConnectionManager.creatorQueue is full at a certain > point. But my cluster does not have so many users cloud reach up 2048 pair of > . > This may be due to the following reasons: > # ConnectionManager.creatorQueue is a queue that will be offered > ConnectionPool if ConnectionContext is not enough. > # ConnectionCreator thread will consume from creatorQueue and make more > ConnectionContexts for a ConnectionPool. > # Client will concurrent invoke for ConnectionManager.getConnection() for a > same user. And this maybe lead to add many same ConnectionPool into > ConnectionManager.creatorQueue. > # When creatorQueue is full, a new ConnectionPool will not be added in > successfully and log this error. This maybe lead to a really new > ConnectionPool clould not produce more ConnectionContexts for new user. > So this pr try to make creatorQueue will not add same ConnectionPool at once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17334) FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait method
[ https://issues.apache.org/jira/browse/HDFS-17334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808062#comment-17808062 ] ASF GitHub Bot commented on HDFS-17334: --- zhangshuyan0 commented on PR #6434: URL: https://github.com/apache/hadoop/pull/6434#issuecomment-1897981697 > > Line211 has already ensured that we have a monitor for this object: > > https://github.com/apache/hadoop/blob/ba6ada73acc2bce560878272c543534c21c76f22/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java#L211-L223 > > > > So, I think the description in this PR is not a problem. What's your opinion? @hfutatzhanghb > > @zhangshuyan0 Sir, `this.wait(1000);` is in do-while loop, when we invoke `this.wait(1000)` at first time, it will release object monitor. But in extreme situation, it will throw Exception when invoke `this.wait(1000)` at the second time, because current thread does not hold the object monitor. Waiting for your response~ Let's see [JAVA doc](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-) : > Thus, on return from the wait method, the synchronization state of the object and of thread T is exactly as it was when the wait method was invoked. Therefore, after `this.wait(1000)` returns at first time, it obtains the monitor again. I think no exception will be thrown here. By the way, in this [JAVA doc](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#wait-long-) , `synchronize -> while loop` is showed as a recommended usage. Looking forward to your response. > FSEditLogAsync#enqueueEdit does not synchronized this before invoke wait > method > --- > > Key: HDFS-17334 > URL: https://issues.apache.org/jira/browse/HDFS-17334 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.6 >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > In method FSEditLogAsync#enqueueEdit , there exist the below codes: > {code:java} > if (Thread.holdsLock(this)) { > // if queue is full, synchronized caller must immediately relinquish > // the monitor before re-offering to avoid deadlock with sync thread > // which needs the monitor to write transactions. > int permits = overflowMutex.drainPermits(); > try { > do { > this.wait(1000); // will be notified by next logSync. > } while (!editPendingQ.offer(edit)); > } finally { > overflowMutex.release(permits); > } > } {code} > It maybe invoke this.wait(1000) without having object this's monitor. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org