[jira] [Created] (HDFS-16268) Balancer stuck when moving striped blocks due to NPE

2021-10-11 Thread Leon Gao (Jira)
Leon Gao created HDFS-16268:
---

 Summary: Balancer stuck when moving striped blocks due to NPE
 Key: HDFS-16268
 URL: https://issues.apache.org/jira/browse/HDFS-16268
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: balancer & mover, erasure-coding
Affects Versions: 3.2.2
Reporter: Leon Gao
Assignee: Leon Gao


{code:java}
21/10/11 06:11:26 WARN balancer.Dispatcher: Dispatcher thread failed
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.markMovedIfGoodBlock(Dispatcher.java:289)
        at 
org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.chooseBlockAndProxy(Dispatcher.java:272)
        at 
org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2500(Dispatcher.java:236)
        at 
org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.chooseNextMove(Dispatcher.java:899)
        at 
org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.dispatchBlocks(Dispatcher.java:958)
        at 
org.apache.hadoop.hdfs.server.balancer.Dispatcher$Source.access$3300(Dispatcher.java:757)
        at 
org.apache.hadoop.hdfs.server.balancer.Dispatcher$2.run(Dispatcher.java:1226)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}
Due to NPE in the middle, there will be pending moves left in the queue so 
balancer will stuck forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16224) testBalancerWithObserverWithFailedNode times out

2021-09-12 Thread Leon Gao (Jira)
Leon Gao created HDFS-16224:
---

 Summary: testBalancerWithObserverWithFailedNode times out
 Key: HDFS-16224
 URL: https://issues.apache.org/jira/browse/HDFS-16224
 Project: Hadoop HDFS
  Issue Type: Test
  Components: test
Reporter: Leon Gao
Assignee: Leon Gao


testBalancerWithObserverWithFailedNode fails intermittently.

 

Seems it is because of datanode cannot shutdown because we need to wait for 
datanodes to finish retries to failed observer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16188) Router to support resolving monitored namenodes with DNS

2021-08-26 Thread Leon Gao (Jira)
Leon Gao created HDFS-16188:
---

 Summary: Router to support resolving monitored namenodes with DNS
 Key: HDFS-16188
 URL: https://issues.apache.org/jira/browse/HDFS-16188
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
 Environment: We can use a DNS round-robin record to configure list of 
monitored namenodes, so we don't have to reconfigure everything namenode 
hostname is changed. For example, in containerized environment the hostname of 
namenode/observers can change pretty often.
Reporter: Leon Gao
Assignee: Leon Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16164) Configuration to allow group with read-all privilege

2021-08-11 Thread Leon Gao (Jira)
Leon Gao created HDFS-16164:
---

 Summary: Configuration to allow group with read-all privilege
 Key: HDFS-16164
 URL: https://issues.apache.org/jira/browse/HDFS-16164
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Leon Gao
Assignee: Leon Gao


We see more use cases that need read-all permission to hdfs. One example is 
data quality service that needs to read all the data but no need to write. 
Currently seems hdfs only supports supergroup that can do anything.

Maybe we can add configuration like dfs.permissions.read-all.group to manage 
this type of permissions easily.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-16157) Support configuring DNS record to get list of journal nodes.

2021-08-09 Thread Leon Gao (Jira)
Leon Gao created HDFS-16157:
---

 Summary: Support configuring DNS record to get list of journal 
nodes.
 Key: HDFS-16157
 URL: https://issues.apache.org/jira/browse/HDFS-16157
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: journal-node
Reporter: Leon Gao
Assignee: Leon Gao


We can use a DNS round-robin record to configure list of journal nodes, so we 
don't have to reconfigure everything journal node hostname is changed. For 
example, in some containerized environment the hostname of journal nodes can 
change pretty often.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15785) Datanode to support using DNS to resolve nameservices to IP addresses to get list of namenodes

2021-07-13 Thread Leon Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leon Gao resolved HDFS-15785.
-
Resolution: Fixed

> Datanode to support using DNS to resolve nameservices to IP addresses to get 
> list of namenodes
> --
>
> Key: HDFS-15785
> URL: https://issues.apache.org/jira/browse/HDFS-15785
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Currently as HDFS supports observers, multiple-standby and router, the 
> namenode hosts are changing frequently in large deployment, we can consider 
> supporting https://issues.apache.org/jira/browse/HDFS-14118 on datanode to 
> reduce the need to update config frequently on all datanodes. In that case, 
> datanode and clients can use the same set of config as well.
> Basically we can resolve the DNS and generate namenode for each IP behind it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15842) HDFS mover to emit metrics

2021-02-19 Thread Leon Gao (Jira)
Leon Gao created HDFS-15842:
---

 Summary: HDFS mover to emit metrics
 Key: HDFS-15842
 URL: https://issues.apache.org/jira/browse/HDFS-15842
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer & mover
Reporter: Leon Gao
Assignee: Leon Gao


We can emit metrics thru metrics2 when running HDFS mover, which can help to 
monitor the progress and turn mover parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15828) Fix javac warnings from PR-2625

2021-02-08 Thread Leon Gao (Jira)
Leon Gao created HDFS-15828:
---

 Summary: Fix javac warnings from PR-2625
 Key: HDFS-15828
 URL: https://issues.apache.org/jira/browse/HDFS-15828
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Leon Gao
Assignee: Leon Gao


This is to follow up javac issues from HDFS-15683

Although the javac issues are not caused by the new commits, we can take the 
chance to fix them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15818) Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig

2021-02-03 Thread Leon Gao (Jira)
Leon Gao created HDFS-15818:
---

 Summary: Fix TestFsDatasetImpl.testReadLockCanBeDisabledByConfig
 Key: HDFS-15818
 URL: https://issues.apache.org/jira/browse/HDFS-15818
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: test
Reporter: Leon Gao
Assignee: Leon Gao


Current TestFsDatasetImpl.testReadLockCanBeDisabledByConfig is incorrect:

1) Test fails intermittently as holder can acquire lock first

[https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2666/1/testReport/]

 

2) Test passes regardless of the setting of 

DFS_DATANODE_LOCK_READ_WRITE_ENABLED_KEY



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15807) RefreshVolume fails when replacing DISK/ARCHIVE vol on same mount

2021-01-31 Thread Leon Gao (Jira)
Leon Gao created HDFS-15807:
---

 Summary: RefreshVolume fails when replacing DISK/ARCHIVE vol on 
same mount
 Key: HDFS-15807
 URL: https://issues.apache.org/jira/browse/HDFS-15807
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode
Reporter: Leon Gao
Assignee: Leon Gao


When refreshing volumes to replace DISK/ARCHIVE on the same mount, it will fail 
because we have a check to see if the same vol type already exists on the mount.

We can resolve it by removing volumes first, then add new volumes in 
refreshVolume logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15785) Datanode to support using DNS to resolve nameservices to IP addresses to get list of namenodes

2021-01-21 Thread Leon Gao (Jira)
Leon Gao created HDFS-15785:
---

 Summary: Datanode to support using DNS to resolve nameservices to 
IP addresses to get list of namenodes
 Key: HDFS-15785
 URL: https://issues.apache.org/jira/browse/HDFS-15785
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Reporter: Leon Gao
Assignee: Leon Gao


Currently as HDFS supports observers, multiple-standby and router, the namenode 
hosts are changing frequently in large deployment, we can consider supporting 
https://issues.apache.org/jira/browse/HDFS-14118 on datanode to reduce the need 
to update config frequently. In that case, datanode and clients can use the 
same set of config as well.

Basically we can resolve the DNS and generate namenode for each IP behind it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15781) Add metrics for block movemements

2021-01-18 Thread Leon Gao (Jira)
Leon Gao created HDFS-15781:
---

 Summary: Add metrics for block movemements
 Key: HDFS-15781
 URL: https://issues.apache.org/jira/browse/HDFS-15781
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode
Reporter: Leon Gao
Assignee: Leon Gao


We can add some metrics for  to track how the blocks are being moved, to get a 
sense of the locality of movements.
 * How many blocks copied to local host?
 * How many blocks moved to local disk thru hardlink?
 * How many blocks are copied out of the host

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15549) Improve DISK/ARCHIVE movement if they are on same filesystem

2020-08-28 Thread Leon Gao (Jira)
Leon Gao created HDFS-15549:
---

 Summary: Improve DISK/ARCHIVE movement if they are on same 
filesystem
 Key: HDFS-15549
 URL: https://issues.apache.org/jira/browse/HDFS-15549
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode
Reporter: Leon Gao
Assignee: Leon Gao


When moving blocks between DISK/ARCHIVE, we should prefer the volume on the 
same underlying filesystem and use "rename" instead of "copy" to save IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15548) Allow configuring DISK/ARCHIVE storage types on same device mount

2020-08-28 Thread Leon Gao (Jira)
Leon Gao created HDFS-15548:
---

 Summary: Allow configuring DISK/ARCHIVE storage types on same 
device mount
 Key: HDFS-15548
 URL: https://issues.apache.org/jira/browse/HDFS-15548
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: datanode
Reporter: Leon Gao
Assignee: Leon Gao


We can allow configuring DISK/ARCHIVE storage types on the same device mount on 
two separate directories.

Users should be able to configure the capacity for each. Also, the datanode 
usage report should report stats correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15547) Dynamic disk-level tiering

2020-08-28 Thread Leon Gao (Jira)
Leon Gao created HDFS-15547:
---

 Summary: Dynamic disk-level tiering
 Key: HDFS-15547
 URL: https://issues.apache.org/jira/browse/HDFS-15547
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: datanode
Reporter: Leon Gao
Assignee: Leon Gao
 Attachments: Proposal - Dynamic disk-level tiering.pdf

This is a proposal for a new use case based on archival storage, to allow 
configuring DISK and ARCHIVE storage types on the same device (filesystem) to 
balance disk IO for disks with different density.

The proposal is to mainly solve two problems:

1) The disk IO of ARCHIVE disks is underutilized. This is normal in many use 
cases where the data hotness is highly skewed.

2) Over the years, as better/cheaper hard drives showing on the market, a large 
production environment can have mixed disk densities. For example, in our prod 
environment, we have 2TB, 4TB, 8TB, and 16TB disks. When putting all different 
HDDs into the cluster, we should be able to utilize disk capacity and disk IO 
efficiently for all of them.

When moving blocks from DISK to ARCHIVE, we can prefer the same disk and simply 
rename the files instead of copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-15509) Set safemode should not fail if one of the namenode is down.

2020-08-02 Thread Leon Gao (Jira)
Leon Gao created HDFS-15509:
---

 Summary: Set safemode should not fail if one of the namenode is 
down.
 Key: HDFS-15509
 URL: https://issues.apache.org/jira/browse/HDFS-15509
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.0
Reporter: Leon Gao
Assignee: Leon Gao


When the first namenode (let's say nn0) is down, set safemode command will 
always fail unless users manually update the configuration. This is distracting 
when debugging issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14927) RBF: Add metrics for active RPC client threads

2019-10-23 Thread Leon Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leon Gao reopened HDFS-14927:
-

Reopen to gather more info

> RBF: Add metrics for active RPC client threads
> --
>
> Key: HDFS-14927
> URL: https://issues.apache.org/jira/browse/HDFS-14927
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
>
> It is good to add some monitoring on the active RPC client threads, so we 
> know the utilization and when to bump up 
> `dfs.federation.router.client.thread-size`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14927) RBF: Add metrics for active RPC client threads

2019-10-23 Thread Leon Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leon Gao resolved HDFS-14927.
-
Resolution: Invalid

> RBF: Add metrics for active RPC client threads
> --
>
> Key: HDFS-14927
> URL: https://issues.apache.org/jira/browse/HDFS-14927
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Minor
>
> It is good to add some monitoring on the active RPC client threads, so we 
> know the utilization and when to bump up 
> `dfs.federation.router.client.thread-size`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14927) RBF: Add metrics for active RPC client threads

2019-10-23 Thread Leon Gao (Jira)
Leon Gao created HDFS-14927:
---

 Summary: RBF: Add metrics for active RPC client threads
 Key: HDFS-14927
 URL: https://issues.apache.org/jira/browse/HDFS-14927
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Reporter: Leon Gao
Assignee: Leon Gao


It is good to add some monitoring on the active RPC client threads, so we know 
the utilization and when to bump up `dfs.federation.router.client.thread-size`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14926) RBF: Add metrics for active RPC client threads

2019-10-23 Thread Leon Gao (Jira)
Leon Gao created HDFS-14926:
---

 Summary: RBF: Add metrics for active RPC client threads
 Key: HDFS-14926
 URL: https://issues.apache.org/jira/browse/HDFS-14926
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Reporter: Leon Gao
Assignee: Leon Gao


It is good to have some monitoring on the # of active client threads, so we 
know when to bump up dfs.federation.router.client.thread-size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14904) Balancer should pick nodes based on utilization in each iteration

2019-10-11 Thread Leon Gao (Jira)
Leon Gao created HDFS-14904:
---

 Summary: Balancer should pick nodes based on utilization in each 
iteration
 Key: HDFS-14904
 URL: https://issues.apache.org/jira/browse/HDFS-14904
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer & mover
Reporter: Leon Gao
Assignee: Leon Gao


In each iteration, balancer should pick nodes with the highest/lowest usage 
first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14894) Add balancer parameter to balance top N used nodes

2019-10-05 Thread Leon Gao (Jira)
Leon Gao created HDFS-14894:
---

 Summary: Add balancer parameter to balance top N used nodes
 Key: HDFS-14894
 URL: https://issues.apache.org/jira/browse/HDFS-14894
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer & mover
Reporter: Leon Gao
Assignee: Leon Gao


We sometimes see a few of our datanodes reach very high usage (due to various 
reasons) and we need to reduce their usage in an urgent situation.

We see two ways to achieve it currently,

-Calculate and reset balancing threshold.

-Pick nodes manually according to usage stats and put them in a file and use 
`-resource` flag.

However, both of them are not very intuitive or too much manual work in an 
urgent close-to-outage situation. Add a small feature to automatically pick top 
N used hosts will be a straightforward option, for example `-top 10` to only 
target top 10 used datanodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org