[jira] [Created] (HDFS-16268) Balancer stuck when moving striped blocks due to NPE
Leon Gao created HDFS-16268: --- Summary: Balancer stuck when moving striped blocks due to NPE Key: HDFS-16268 URL: https://issues.apache.org/jira/browse/HDFS-16268 Project: Hadoop HDFS Issue Type: Bug Components: balancer & mover, erasure-coding Affects Versions: 3.2.2 Reporter: Leon Gao Assignee: Leon Gao {code:java} 21/10/11 06:11:26 WARN balancer.Dispatcher: Dispatcher thread failed java.lang.NullPointerException at org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.markMovedIfGoodBlock(Dispatcher.java:289) at org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.chooseBlockAndProxy(Dispatcher.java:272) at org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2500(Dispatcher.java:236) at org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.chooseNextMove(Dispatcher.java:899) at org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.dispatchBlocks(Dispatcher.java:958) at org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.access$3300(Dispatcher.java:757) at org.apache.hadoop.hdfs.server.balancer.Dispatcher$2.run(Dispatcher.java:1226) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} Due to NPE in the middle, there will be pending moves left in the queue so balancer will stuck forever. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16224) testBalancerWithObserverWithFailedNode times out
Leon Gao created HDFS-16224: --- Summary: testBalancerWithObserverWithFailedNode times out Key: HDFS-16224 URL: https://issues.apache.org/jira/browse/HDFS-16224 Project: Hadoop HDFS Issue Type: Test Components: test Reporter: Leon Gao Assignee: Leon Gao testBalancerWithObserverWithFailedNode fails intermittently. Seems it is because of datanode cannot shutdown because we need to wait for datanodes to finish retries to failed observer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16188) Router to support resolving monitored namenodes with DNS
Leon Gao created HDFS-16188: --- Summary: Router to support resolving monitored namenodes with DNS Key: HDFS-16188 URL: https://issues.apache.org/jira/browse/HDFS-16188 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Environment: We can use a DNS round-robin record to configure list of monitored namenodes, so we don't have to reconfigure everything namenode hostname is changed. For example, in containerized environment the hostname of namenode/observers can change pretty often. Reporter: Leon Gao Assignee: Leon Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16164) Configuration to allow group with read-all privilege
Leon Gao created HDFS-16164: --- Summary: Configuration to allow group with read-all privilege Key: HDFS-16164 URL: https://issues.apache.org/jira/browse/HDFS-16164 Project: Hadoop HDFS Issue Type: Improvement Reporter: Leon Gao Assignee: Leon Gao We see more use cases that need read-all permission to hdfs. One example is data quality service that needs to read all the data but no need to write. Currently seems hdfs only supports supergroup that can do anything. Maybe we can add configuration like dfs.permissions.read-all.group to manage this type of permissions easily. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-16157) Support configuring DNS record to get list of journal nodes.
Leon Gao created HDFS-16157: --- Summary: Support configuring DNS record to get list of journal nodes. Key: HDFS-16157 URL: https://issues.apache.org/jira/browse/HDFS-16157 Project: Hadoop HDFS Issue Type: Improvement Components: journal-node Reporter: Leon Gao Assignee: Leon Gao We can use a DNS round-robin record to configure list of journal nodes, so we don't have to reconfigure everything journal node hostname is changed. For example, in some containerized environment the hostname of journal nodes can change pretty often. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15785) Datanode to support using DNS to resolve nameservices to IP addresses to get list of namenodes
[ https://issues.apache.org/jira/browse/HDFS-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leon Gao resolved HDFS-15785. - Resolution: Fixed > Datanode to support using DNS to resolve nameservices to IP addresses to get > list of namenodes > -- > > Key: HDFS-15785 > URL: https://issues.apache.org/jira/browse/HDFS-15785 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Major > Labels: pull-request-available > Time Spent: 4h 10m > Remaining Estimate: 0h > > Currently as HDFS supports observers, multiple-standby and router, the > namenode hosts are changing frequently in large deployment, we can consider > supporting https://issues.apache.org/jira/browse/HDFS-14118 on datanode to > reduce the need to update config frequently on all datanodes. In that case, > datanode and clients can use the same set of config as well. > Basically we can resolve the DNS and generate namenode for each IP behind it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15842) HDFS mover to emit metrics
Leon Gao created HDFS-15842: --- Summary: HDFS mover to emit metrics Key: HDFS-15842 URL: https://issues.apache.org/jira/browse/HDFS-15842 Project: Hadoop HDFS Issue Type: Improvement Components: balancer & mover Reporter: Leon Gao Assignee: Leon Gao We can emit metrics thru metrics2 when running HDFS mover, which can help to monitor the progress and turn mover parameters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15828) Fix javac warnings from PR-2625
Leon Gao created HDFS-15828: --- Summary: Fix javac warnings from PR-2625 Key: HDFS-15828 URL: https://issues.apache.org/jira/browse/HDFS-15828 Project: Hadoop HDFS Issue Type: Improvement Reporter: Leon Gao Assignee: Leon Gao This is to follow up javac issues from HDFS-15683 Although the javac issues are not caused by the new commits, we can take the chance to fix them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15818) Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig
Leon Gao created HDFS-15818: --- Summary: Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig Key: HDFS-15818 URL: https://issues.apache.org/jira/browse/HDFS-15818 Project: Hadoop HDFS Issue Type: Sub-task Components: test Reporter: Leon Gao Assignee: Leon Gao Current TestFsDatasetImpl.testReadLockCanBeDisabledByConfig is incorrect: 1) Test fails intermittently as holder can acquire lock first [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2666/1/testReport/] 2) Test passes regardless of the setting of DFS_DATANODE_LOCK_READ_WRITE_ENABLED_KEY -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15807) RefreshVolume fails when replacing DISK/ARCHIVE vol on same mount
Leon Gao created HDFS-15807: --- Summary: RefreshVolume fails when replacing DISK/ARCHIVE vol on same mount Key: HDFS-15807 URL: https://issues.apache.org/jira/browse/HDFS-15807 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Reporter: Leon Gao Assignee: Leon Gao When refreshing volumes to replace DISK/ARCHIVE on the same mount, it will fail because we have a check to see if the same vol type already exists on the mount. We can resolve it by removing volumes first, then add new volumes in refreshVolume logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15785) Datanode to support using DNS to resolve nameservices to IP addresses to get list of namenodes
Leon Gao created HDFS-15785: --- Summary: Datanode to support using DNS to resolve nameservices to IP addresses to get list of namenodes Key: HDFS-15785 URL: https://issues.apache.org/jira/browse/HDFS-15785 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Reporter: Leon Gao Assignee: Leon Gao Currently as HDFS supports observers, multiple-standby and router, the namenode hosts are changing frequently in large deployment, we can consider supporting https://issues.apache.org/jira/browse/HDFS-14118 on datanode to reduce the need to update config frequently. In that case, datanode and clients can use the same set of config as well. Basically we can resolve the DNS and generate namenode for each IP behind it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15781) Add metrics for block movemements
Leon Gao created HDFS-15781: --- Summary: Add metrics for block movemements Key: HDFS-15781 URL: https://issues.apache.org/jira/browse/HDFS-15781 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Reporter: Leon Gao Assignee: Leon Gao We can add some metrics for to track how the blocks are being moved, to get a sense of the locality of movements. * How many blocks copied to local host? * How many blocks moved to local disk thru hardlink? * How many blocks are copied out of the host -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15549) Improve DISK/ARCHIVE movement if they are on same filesystem
Leon Gao created HDFS-15549: --- Summary: Improve DISK/ARCHIVE movement if they are on same filesystem Key: HDFS-15549 URL: https://issues.apache.org/jira/browse/HDFS-15549 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Reporter: Leon Gao Assignee: Leon Gao When moving blocks between DISK/ARCHIVE, we should prefer the volume on the same underlying filesystem and use "rename" instead of "copy" to save IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15548) Allow configuring DISK/ARCHIVE storage types on same device mount
Leon Gao created HDFS-15548: --- Summary: Allow configuring DISK/ARCHIVE storage types on same device mount Key: HDFS-15548 URL: https://issues.apache.org/jira/browse/HDFS-15548 Project: Hadoop HDFS Issue Type: Sub-task Components: datanode Reporter: Leon Gao Assignee: Leon Gao We can allow configuring DISK/ARCHIVE storage types on the same device mount on two separate directories. Users should be able to configure the capacity for each. Also, the datanode usage report should report stats correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15547) Dynamic disk-level tiering
Leon Gao created HDFS-15547: --- Summary: Dynamic disk-level tiering Key: HDFS-15547 URL: https://issues.apache.org/jira/browse/HDFS-15547 Project: Hadoop HDFS Issue Type: New Feature Components: datanode Reporter: Leon Gao Assignee: Leon Gao Attachments: Proposal - Dynamic disk-level tiering.pdf This is a proposal for a new use case based on archival storage, to allow configuring DISK and ARCHIVE storage types on the same device (filesystem) to balance disk IO for disks with different density. The proposal is to mainly solve two problems: 1) The disk IO of ARCHIVE disks is underutilized. This is normal in many use cases where the data hotness is highly skewed. 2) Over the years, as better/cheaper hard drives showing on the market, a large production environment can have mixed disk densities. For example, in our prod environment, we have 2TB, 4TB, 8TB, and 16TB disks. When putting all different HDDs into the cluster, we should be able to utilize disk capacity and disk IO efficiently for all of them. When moving blocks from DISK to ARCHIVE, we can prefer the same disk and simply rename the files instead of copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-15509) Set safemode should not fail if one of the namenode is down.
Leon Gao created HDFS-15509: --- Summary: Set safemode should not fail if one of the namenode is down. Key: HDFS-15509 URL: https://issues.apache.org/jira/browse/HDFS-15509 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.0 Reporter: Leon Gao Assignee: Leon Gao When the first namenode (let's say nn0) is down, set safemode command will always fail unless users manually update the configuration. This is distracting when debugging issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-14927) RBF: Add metrics for active RPC client threads
[ https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leon Gao reopened HDFS-14927: - Reopen to gather more info > RBF: Add metrics for active RPC client threads > -- > > Key: HDFS-14927 > URL: https://issues.apache.org/jira/browse/HDFS-14927 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Minor > > It is good to add some monitoring on the active RPC client threads, so we > know the utilization and when to bump up > `dfs.federation.router.client.thread-size` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-14927) RBF: Add metrics for active RPC client threads
[ https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leon Gao resolved HDFS-14927. - Resolution: Invalid > RBF: Add metrics for active RPC client threads > -- > > Key: HDFS-14927 > URL: https://issues.apache.org/jira/browse/HDFS-14927 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Leon Gao >Assignee: Leon Gao >Priority: Minor > > It is good to add some monitoring on the active RPC client threads, so we > know the utilization and when to bump up > `dfs.federation.router.client.thread-size` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14927) RBF: Add metrics for active RPC client threads
Leon Gao created HDFS-14927: --- Summary: RBF: Add metrics for active RPC client threads Key: HDFS-14927 URL: https://issues.apache.org/jira/browse/HDFS-14927 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Reporter: Leon Gao Assignee: Leon Gao It is good to add some monitoring on the active RPC client threads, so we know the utilization and when to bump up `dfs.federation.router.client.thread-size` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14926) RBF: Add metrics for active RPC client threads
Leon Gao created HDFS-14926: --- Summary: RBF: Add metrics for active RPC client threads Key: HDFS-14926 URL: https://issues.apache.org/jira/browse/HDFS-14926 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Reporter: Leon Gao Assignee: Leon Gao It is good to have some monitoring on the # of active client threads, so we know when to bump up dfs.federation.router.client.thread-size -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14904) Balancer should pick nodes based on utilization in each iteration
Leon Gao created HDFS-14904: --- Summary: Balancer should pick nodes based on utilization in each iteration Key: HDFS-14904 URL: https://issues.apache.org/jira/browse/HDFS-14904 Project: Hadoop HDFS Issue Type: Improvement Components: balancer & mover Reporter: Leon Gao Assignee: Leon Gao In each iteration, balancer should pick nodes with the highest/lowest usage first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-14894) Add balancer parameter to balance top N used nodes
Leon Gao created HDFS-14894: --- Summary: Add balancer parameter to balance top N used nodes Key: HDFS-14894 URL: https://issues.apache.org/jira/browse/HDFS-14894 Project: Hadoop HDFS Issue Type: Improvement Components: balancer & mover Reporter: Leon Gao Assignee: Leon Gao We sometimes see a few of our datanodes reach very high usage (due to various reasons) and we need to reduce their usage in an urgent situation. We see two ways to achieve it currently, -Calculate and reset balancing threshold. -Pick nodes manually according to usage stats and put them in a file and use `-resource` flag. However, both of them are not very intuitive or too much manual work in an urgent close-to-outage situation. Add a small feature to automatically pick top N used hosts will be a straightforward option, for example `-top 10` to only target top 10 used datanodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org