[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2020-07-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169220#comment-17169220
 ] 

Hadoop QA commented on HDFS-13157:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
45s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 34m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m 
41s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
41s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
38s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 38s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 40s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 161 unchanged - 1 fixed = 163 total (was 162) {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
37s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  4m 
18s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
38s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 40s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 72m 56s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/21/artifact/out/Dockerfile
 |
| JIRA Issue | HDFS-13157 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12979084/HDFS-13157.1.patch |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 7785f115c6a5 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / a7fda2e38f2 |
| Default Java | Private Build-1.8.0_252-8u252-b09-1~18.04-b09 |
| mvninstall | 
https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/21/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| compile | 

[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2020-07-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169154#comment-17169154
 ] 

Hadoop QA commented on HDFS-13157:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} https://github.com/apache/hadoop/pull/1391 does not apply to 
trunk. Rebase required? Wrong Branch? See 
https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| GITHUB PR | https://github.com/apache/hadoop/pull/1391 |
| JIRA Issue | HDFS-13157 |
| Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-1391/4/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |


This message was automatically generated.



> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2020-07-31 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168982#comment-17168982
 ] 

Hadoop QA commented on HDFS-13157:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} https://github.com/apache/hadoop/pull/1391 does not apply to 
trunk. Rebase required? Wrong Branch? See 
https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| GITHUB PR | https://github.com/apache/hadoop/pull/1391 |
| JIRA Issue | HDFS-13157 |
| Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-1391/3/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |


This message was automatically generated.



> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-18 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932488#comment-16932488
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

As we uncovered quite a few issues in this Jira, I have opened a new Jira to 
propose a new implementation for the decommission monitor thread in HDFS-14854. 
I have posted a design doc there and should be able to post a patch that 
implements the ideas later today or tomorrow.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-12 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928827#comment-16928827
 ] 

David Mollitor commented on HDFS-13157:
---

[~sodonnell]

Doing a 'random queue' is very tricky.  It's always mathematically possible 
that some items sits in the queue for all time since it's random, there is no 
guarantee that it will ever be selected.

I am thinking something like this as an alternative:

# Mark each node as decommissioned
# Grab the lock
# Create small batches of blocks... rolling through the list of DataNodes, 
rolling through the list of volumes (as proposed in this patch)
# Wrap each item in the batch into a `Future` and submit them into the queue
# Release the lock
# Wait for every `Future` in the batch to complete (with a timeout)
# Repeat until done

This would require the Replication Queue take a future, which is probably not a 
bad thing anyway.


> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-12 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928431#comment-16928431
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

I've been thinking about this issue on and off over the last few days and I 
realised that most of the problems come from dumping too many blocks onto a 
queue all at once. In this issue we have uncovered the following problems:
 # Blocks are placed on the queue disk by disk for a datanode, and the queue is 
basically a FIFO, limiting decommission speed of that node to the speed of 1 
disk. It is hard to break away from that order, as our minimum unit of work 
under a lock is probably a single DN storage.
 # We also process each DN one at a time, or an operator can decommission one 
node, followed by another shortly after, so the first node to be decommissioned 
tends to decommission faster than the others, and hence does not use all the 
cluster resources it can.
 # I suspect that sometimes blocks get skipped when processing the replication 
queue if there are no replication streams available, and it can take a very 
long time for it to cycle back around to the start if there are millions of 
blocks to process. This could result in some nodes failing to complete 
decommission due to only a few blocks that were skipped previously.

It is also worth noting that while the under replicated blocks are placed onto 
the replication queue, the list of them is also retained in the decommission 
monitor, and they are all checked periodically to see if they have replicated 
yet. This means, the pending blocks will be checked many times over the hours 
or days decommission is running. This is also a waste of resources.

I think we could change the logic, so that when a node (or nodes) are 
decommissioned, we:
 # Process the blocks on the node and store them in the decommission monitor, 
but do not yet add them to the replication queue.
 # In the decommission monitor, we should store the blocks in a structure that 
allows them to be retrieved efficiently in a random order. Perhaps a hashmap, 
but there may be something better.
 # This means that the blocks from each disk and each node will be naturally in 
this structure in a random order.
 # Every X seconds, check the size of the pending replication queue. If it is 
under some threshold, take Y blocks from the 'waiting to decommission block 
list', check if they are under-replicated, and if so, add them to the 
replication queue and store them in a new pending list in the decommission 
monitor.
 # Periodically check only the blocks in the pending list to see if they have 
been replicated, and check if the replication queue is under some threshold, 
then repeat the process.

This way we avoid flooding the replication queue with millions of blocks and 
release them onto it in a more controlled fashion.

The blocks should be in a random order too.

If a node failure happens, the replication queue will fill up with its blocks 
and decommission will temporarily stall, giving priority to recovering the 
blocks from the failed node, which is probably desirable.

 

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-09 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925523#comment-16925523
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

> How is it handled, iterating through each DataNode, that a block is scheduled 
> to be replicated onto a DataNode that will be decommissioned further down in 
> the list?

The block manager takes care of this when allocating a new block target. Nodes 
that are in a decommissioning state will not be considered as a new target.

In the current decommissioning implementation, the nodes selected for 
decommissioning are processed in a very conservative way, in that for nodes 
with more than 500K blocks (dfs.namenode.decommission.blocks.per.interval) it 
will process one node, and then sleep for 30 seconds before processing the next 
one. I believe this is to prevent locking the Namenode too often in close 
succession.

When processing a node, we really need to hold a lock while processing some 
unit of work. The work in the datanode is split into its storage volumes, and 
we use an iterator to process all the blocks on the iterator. If you drop the 
lock part way through that iterator, then a block report or file modification 
in HDFS can change the contents of the storage and then the iterator will get a 
concurrent modification exception. Therefore interleaving many DNs for 
processing at the same time is tricky. Each one needs an exclusive lock and 
they will all be contenting for it. If we drop and re-take the lock for each 
block we will need to bookmark the iterator and handle 
concurrentModificationException, possibly frequently.

There is also no guarantee a user would not decomm node 1, then 10 minutes 
later decom node 2 and so on, and the suggested strategy would not help with 
that.

I still believe the simplest fix to this issue is to change the implementation 
of the pending replication queue to process it in a random order rather than 
FIFO, but, that does not help deal with nodes which have had some blocks 
skipped on the first pass and need to be processed a second time, but we may be 
able to solve that by retrying them a few times as you also suggested.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-08 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925176#comment-16925176
 ] 

David Mollitor commented on HDFS-13157:
---

One other thing here I don't quite understand

How is it handled, iterating through each DataNode, that a block is scheduled 
to be replicated onto a DataNode that will be decommissioned further down in 
the list?  i.e. Node A replicates to Node B, then Node B is decommissioned, so 
that same block is now replicated onto Node C, and so on...

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-08 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925165#comment-16925165
 ] 

David Mollitor commented on HDFS-13157:
---

Thank you all for the great investigatory work.

One question I have is in regards to:

bq. dropped until all other blocks have been tried and the iterator cycles 
round.

On the second run through the Iterator, shouldn't there be many fewer blocks in 
the list since many were successfully replicated away?


Regardless, I would like to propose a few idea for addressing these concerns.

# Update the Iterator to rotate over the volumes (it is very common that an 
Iterator does not guarantee any kind of order)
# With the lock held, process each DataNode as a task and process each task on 
its own thread.  In this way the duration of the lock held can be decreased and 
the replication requests will be interwoven across all the relevant DataNodes 
in the queue instead of loading the requests DN by DN in a serial fashion.
# Update documentation to include this information regarding how many blocks 
can be replicated per day, etc.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-03 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921331#comment-16921331
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

{quote}
 # Add a configuration, which makes NN to release the lock every 
1(configurable) blocks.

 \{quote}

There was some discussion in related to this in HDFS-10477, and they decided to 
drop the lock after processing each storage. The reason, is that the iterator 
for the storage could get a ConcurrentModificationException if its contents 
change when the lock is dropped and retaken. Locking at the storage level is 
probably a good middle ground between how it works currently locking on a block 
count threshold.

--Thinking about the problem on replicating older blocks first ... We currently 
have several replication queues, and blocks with only 1 replica should go into 
the highest priority queue. That means other blocks (only 2 replicas) and 
decommissioning blocks are in the 'normal' queue. Looking at how that queue is 
currently processed, it begins at the start and:
 # Gets 2 * live_nodes blocks
 # Attempts to schedule them for replication based on max-streams limits
 # Any that are not scheduled are simply dropped until all other blocks have 
been tried and the iterator cycles round.

Therefore even in the current implementation some of the blocks can get left 
behind for some time.

This does seem to be a tricky problem to get correct, as there are quite a few 
edge cases and scenarios to consider.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-03 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921211#comment-16921211
 ] 

Chen Zhang commented on HDFS-13157:
---

Thanks [~sodonnell] for your detailed analysis, since the comments are very 
long, I'll try to outline the conclusion we made util now:
 # When decommissioning one DN, the blocks is added to replication queue one 
disk by another.
 # When scheduling replication works, the redundancyMonitor get the blocks from 
highest priority queue one by one, so for each DN, the blocks will concentrate 
on 1 disk.
 # In some cases, the decommission may happen on only one DN (e.g. first 
liveNode*2 blocks are all from one DN, and the blocks have no overlap with 
other decommissioning nodes).
 ## *My thoughts here*: actually in most case, all DN will decommission in 
progress at the same time ,but the 1st node will be much faster, which is also 
not desired by us
 # [~sodonnell] observed that the time the NN lock is held when processing a 
node for decommission maybe very long.

 

Actually we've encountered all these problem on our online cluster, I'd like to 
share our solutions here:
 # Add a new replication queue implementation, it randomized the order of 
getting blocks from it.
 # Add a configuration, which makes NN to release the lock every 
1(configurable) blocks.

*I think randomize replication queue is better than randomize the 
blockIterator,* because randomize blockIterator won't resolve the issue-3 we 
mentioned above and makes the optimization of issue-4 harder.

*But randomize the replication queue have some other bad impactions that we 
should consider here*: usually the blocks in the head should schedule before 
the blocks in the tail, because the longer time a block waiting for 
replication, the probability of data loss is bigger, but randomize the order 
makes it's impossible.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921065#comment-16921065
 ] 

Hadoop QA commented on HDFS-13157:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
41s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 13s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
39s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
37s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 43s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 162 unchanged - 1 fixed = 164 total (was 163) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  9s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
42s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 89m 49s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
41s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}149m  9s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.TestGetBlocks |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/hadoop-multibranch/job/PR-1391/1/artifact/out/Dockerfile
 |
| GITHUB PR | https://github.com/apache/hadoop/pull/1391 |
| JIRA Issue | HDFS-13157 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite 
unit shadedclient findbugs checkstyle |
| uname | Linux 4b8395cfeab6 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/hadoop.sh |
| git revision | trunk / 915cbc9 |
| Default Java | 1.8.0_222 |
| checkstyle | 

[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-02 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920783#comment-16920783
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

I tested my theory that this problem can also result in only 1 node making 
decommission progress when several are decommissioned at the same time. Using a 
simulated cluster with two carefully picked nodes, such that node 1 and node 2 
do not host any of the same blocks, I can see the first to start decommission 
makes progress while the other does not make any progress. In a real cluster, 
there is likely to be some overlap in the blocks between the two clusters, so 
both will make some progress, but this is only because the replication monitor 
notices 2 new replicas are needed for the block and schedules two new copies at 
the same time.

Therefore this problem is worse than concentrating decommission on a single 
disk, but it also does not really work on more than 1 node at a time.

I am also concerned about the time the NN lock is held when processing a node 
for decommission. In tests on my laptop, it takes about 300ms for a node with 
340K blocks and 660ms for 1M blocks. Scaling up to a 5M blocks this could hold 
the lock for about 3 seconds per node. There is a delay between each node, but 
it is still not ideal to block the NN for that long.

Randomizing the iterator in the way suggested here would prevent us from making 
a later change to drop and retake the NN lock per storage on the DN to improve 
the locking time.

This makes me think the solution to this problem is not to randomize the blocks 
from one node onto the replication queue, but instead to randomize the order 
the replication queue is processed somehow.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-02 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920728#comment-16920728
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

I believe BlockManager.getBlocksWithLocations() is used by the HDFS balancer, 
as it is supposed to move 'random blocks' from a given DN to another. However 
that code path has some problems I have been meaning to look into, but so far 
have not had time.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920633#comment-16920633
 ] 

Hadoop QA commented on HDFS-13157:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
26s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 44s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 47s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 162 unchanged - 1 fixed = 164 total (was 163) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 45s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}101m  0s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
39s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}164m 26s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.server.namenode.TestDefaultBlockPlacementPolicy |
|   | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
|   | hadoop.hdfs.TestErasureCodingPolicyWithSnapshot |
|   | hadoop.hdfs.server.namenode.TestNamenodeRetryCache |
|   | hadoop.hdfs.TestDecommission |
|   | hadoop.hdfs.server.namenode.TestFSNamesystemMBean |
|   | hadoop.hdfs.server.namenode.TestAddStripedBlocks |
|   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandbyWithQJM |
|   | hadoop.hdfs.TestClientProtocolForPipelineRecovery |
|   | hadoop.hdfs.server.namenode.TestPersistentStoragePolicySatisfier |
|   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.TestStripedFileAppend |
|   | hadoop.hdfs.server.namenode.TestCacheDirectives |
|   | hadoop.hdfs.server.namenode.TestStoragePolicySatisfierWithHA |
|   | hadoop.hdfs.TestGetBlocks |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | HDFS-13157 |
| JIRA Patch URL | 

[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920549#comment-16920549
 ] 

David Mollitor commented on HDFS-13157:
---

OK.

 

I dropped the first patch.  I'm in a bit of a rush, didn't get a change to run 
all the unit tests locally, but wanted to get some movement on this 
none-the-less.

 

As is often the case with this project, I had to touch more than I would have 
wanted.  There is a weird dependency on the Iterator with 
[BlockManager.java|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L1589]

I really don't understand why {{BlockManager}} is trying to take a random 
sample here.  Maybe someone else knows why?

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
> Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread He Xiaoqiao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920441#comment-16920441
 ] 

He Xiaoqiao commented on HDFS-13157:


Thanks [~belugabehr] for the great work and detailed analysis. I believe this 
issue is more obvious in Federation arch. setup. +1 for the deep dig via 
[~sodonnell], and we could tune parameters [blocksReplWorkMultiplier, 
maxReplicationStreams, maxReplicationStreamsHardLimit] just for per namespace, 
but the common operation to decommission node is triggered from shell even at 
the same time, and different namenode send replication command also at the same 
time if this node is reporting to different namespace. Then load of this 
decommission in progress node is out of control. I have met both network and 
single disk i/o bottleneck.
I believe the current parameter is enough to use for solving network bottleneck.
fo single disk io bottleneck, +1 for update DatanodeDescriptor#BlockIterator 
and support to iterator block from alternate disks rather than iterator blocks 
from disk one by one.
another thought, we should not dispatch write operation to decommission in 
progress node and decrease read priority to the lowest just as decommissioned 
node, then if high load of decommissioning nodes or not is completely not 
affect to client or cluster.
this discussion is not including RAID and scenarios [~zhangchen] mentioned 
above.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920396#comment-16920396
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

The way I believe the redundancyMonitor works is as follows:

It picks the next live_nodes * work_multiplier (default 2) blocks from the 
needs_replication queue in the order they were added to the queue.

Then it looks at the nodes which host the block and randomly picks one of them 
to do the replication that does not have more than maxReplicationStreams 
already allocated. I see the comment in chooseSourceDatanodes() that states it 
prefers decommissioning nodes, and I think it implements this by allocating 
max_replication_streams blocks to IN_SERVICE nodes but 
max_replication_stream_Hard_Limit to decommissioning nodes, ie the 
decommissioning nodes have a higher limit normally.

Imagine we have maxReplicationStreams of 5 (a very low setting - many clusters 
will have this set to 50 or more) maxReplicationStreamsHardLimit of 10 and and 
200 live nodes.

This means the redundancy monitor will pick 200 * 2 (work multiplier default) = 
400 blocks to process on each iteration.

It will then randomly select 1 of the datanodes as a source, meaning the 
decommissioning node will get allocated 10 (hard limit) out of the first 30 
blocks on average. However, then it will have reached its maxStreamsLimit and 
the remaining 370 blocks should be assigned to other nodes (assuming they have 
capacity).

Therefore for replication factor 3 blocks, the decommissioning node will likely 
replicate much less than a third of its own blocks, but all those blocks will 
likely be on one disk.

My reading of the logic therefore also suggests that if you decommission 
several nodes, then it will take the redundancy monitor some time to consider 
the blocks from the second node, as it will work through the list of blocks in 
order. However there is a good chance those other decommissioning nodes will be 
participating in replicating blocks from the first node, so they are unlikely 
to be idle.

However the scenario [~zhangchen] mentioned is an interesting one, where blocks 
have replication factor 1 as the decommissioning node must be the source. In 
that case I think the redundancy monitor would process 400 blocks, assign 10 of 
them and skip the next 390 on each iteration until it hits the second node 
being decommissioned and repeat this until it cycles back to the start of the 
under replicated list again. So if you are decommission nodes with replication 
factor 1 blocks, not only would it use only 1 disk, but it would only work on 
one decommissioning node at a time. I have not tested this, so there may be 
some logic I have not understood fully to handle this sort of case.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-09-01 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920359#comment-16920359
 ] 

Chen Zhang commented on HDFS-13157:
---

We've observed same problem on our production cluster about half year ago, it's 
very to repro the issue when we decommissioning the node with warm data, we use 
HDFS Raid to convert these data to Erasure Coding, so every block have only 1 
replica, so every block only have 1 source node to replicate the data.

We observed that the disks I/O utilization raise to 100% one by one on the 
decommissioning node, it make the decommission progress very slow.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-31 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920137#comment-16920137
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

Another thought - if this problem exists, where replication works predominantly 
off one decommissioning disk on the node, then if you decommission X nodes, 
does that mean the first node tends to decommission at a faster rate than the 
second due to the order the blocks are added to the replication queue? Ie, Node 
1's blocks are all added, then node 2's etc. I think we need to dig into how 
the replication queue is processed in some more detail to really understand how 
blocks are replicated after getting into that queue. There are several limits 
in play, such is max-streams, max-streams-hard-limit (which are per DN) and 
work multiplier which governs the total blocks that can have replication 
in-flight across the entire cluster.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-31 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920125#comment-16920125
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

Thinking about this some more, the current logic looks like:
{code:java}
for each block on DN {
  if (not_sufficiently_replicated) {
add_to_replication_queue(block)
add_to_insufficientList(block)
  }
} {code}
If it is possible to drop and re-take the namenode log for each datanode disk 
in a similar way as HDFS-10477, then I wonder if we could shuffle the order 
blocks are added to the replication_queue, rather than shuffle the order they 
are read from the datanode storage?

Eg, something like:
{code:java}
for each storage on DN {
  get_nn_lock()
  for each block on storage {
if (not_sufficiently_replicated) {
  add_to_insufficientList(block)
}
  }
  release_nn_lock()
}
add_to_replication_queue_in_random_order(insufficientList) {code}
However, it may not be that simple, as between the locks, the state of some 
blocks may have changed. Eg they could have been deleted or replica count 
changed etc, meaning they need to be rechecked for sufficient replication, 
duplicating work already done.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-30 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919628#comment-16919628
 ] 

Stephen O'Donnell commented on HDFS-13157:
--

This is a very interesting observation. The DatanodeAdminManager writes the 
blocks to be replicated into the replication queue, and then it is processed 
like a queue, in order, where work is then allocated to the datanodes. So it 
seems you are correct, in that the blocks will be replicated disk by disk.

{quote}

With that said, perhaps the decommissioning node should be exempted from 
replication altogether (for all blocks with replication >1) so that the load is 
spread randomly throughout all the DataNodes in the cluster and do not get 
concentrated on the decommissioning node.

{quote}

It is also worth keeping in mind, that a node in decommissioning state will not 
receive any new writes and reads are only directed to it as a last resort (ie 
no other replica available). Therefore the load on a decommission node would 
normally be less, so it should be able to replicate more. Also each DN can only 
have a set number of inflight transfers at once. So if the decommissioning node 
was not able to keep up, I think more of its blocks would get scheduled on 
other nodes as they clear their work queue more quickly, but I am not certain 
on that.

One observation I had when looking at decommissioning, is that the NN write 
lock is held for the entire time it takes to process the blocks on a given DN 
on the first pass. This time can be several seconds on a node with a lot of 
blocks. HDFS-10477 observed the same behaviour for recommission and made a 
change to drop the write lock after processing each storage rather that holding 
it for all storages. I wonder if shuffling the iterators would make a change 
like that for starting the decommission impossible?

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-30 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919609#comment-16919609
 ] 

Wei-Chiu Chuang commented on HDFS-13157:


Good work, [~belugabehr]! I think it makes sense to look at the datanode side 
during a decomm at the very least. Previously I thought NameNode would be the 
bottleneck for decomm, but I am probably wrong.

[~sodonnell] FYI since you've been looking at decomm lately.


Looking at BlockManager implementation for re-replication: 
BlockManager#scheduleReconstruction(), it does prefer to replicate from 
DECOMMISSION_INPROGRESS state datanodes.
{code:java}
/**
   * Parse the data-nodes the block belongs to and choose a certain number
   * from them to be the recovery sources.
   *
   * We prefer nodes that are in DECOMMISSION_INPROGRESS state to other nodes
   * since the former do not have write traffic and hence are less busy.
   * We do not use already decommissioned nodes as a source, unless there is
   * no other choice.
   * Otherwise we randomly choose nodes among those that did not reach their
   * replication limits. However, if the recovery work is of the highest
   * priority and all nodes have reached their replication limits, we will
   * randomly choose the desired number of nodes despite the replication limit.
   *
   * In addition form a list of all nodes containing the block
   * and calculate its replication numbers.
   *
   * @param block Block for which a replication source is needed
   * @param containingNodes List to be populated with nodes found to contain
   *the given block
   * @param nodesContainingLiveReplicas List to be populated with nodes found
   *to contain live replicas of the given
   *block
   * @param numReplicas NumberReplicas instance to be initialized with the
   *counts of live, corrupt, excess, and decommissioned
   *replicas of the given block.
   * @param liveBlockIndices List to be populated with indices of healthy
   * blocks in a striped block group
   * @param priority integer representing replication priority of the given
   * block
   * @return the array of DatanodeDescriptor of the chosen nodes from which to
   * recover the given block
   */
@VisibleForTesting
  DatanodeDescriptor[] chooseSourceDatanodes(BlockInfo block,
  List containingNodes,
  List nodesContainingLiveReplicas,
  NumberReplicas numReplicas,
  List liveBlockIndices, int priority) {
{code}

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-30 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919546#comment-16919546
 ] 

David Mollitor commented on HDFS-13157:
---

I'm pretty sure that it is the case the the decommissioning node will be in 1/3 
of the replications.  Making it more distributed will be a much bigger lift.  
At the very least I can round-robin the blocks based on drives.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-30 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919534#comment-16919534
 ] 

David Mollitor commented on HDFS-13157:
---

Also... One area of research...

 

I'm not sure if the Replication Commands are all sent to the one 
decommissioning DataNode or if the blocks that needs to be replicated are 
examined and the load is spread across all of the DataNodes.  Even if that is 
the case, if the block replication is 3, and the nodes selected for replication 
are random, the decommissioning node will be selected 1/3 of the time for 
replication.  With that said, perhaps the decommissioning node should be 
exempted from replication altogether (for all blocks with replication >1) so 
that the load is spread randomly throughout all the DataNodes in the cluster 
and do not get concentrated on the decommissioning node.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-29 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918978#comment-16918978
 ] 

David Mollitor commented on HDFS-13157:
---

Yup.  That's how I plan on improving this:

 {code:java|title=DatanodeDescriptor.java}
private void update() {
  while(index < iterators.size() - 1 && !iterators.get(index).hasNext()) {
index++;
  }
}
{code}

https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L617-L621

I'm going to round-robin this iterator.  Each call to {{next()}} on this 
{{BlockIterator()}} will call {{next()}} cyclically on the list of available 
{{Iterator}}s until they are all exhausted instead of draining each 
{{Iterator}} in order.
 
I just need to see what else might be using this Iterator and see if the other 
places need this linear behavior.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-29 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918972#comment-16918972
 ] 

David Mollitor commented on HDFS-13157:
---

Yup.  It does it in order based on Storage (volume)...

 

... Process The Blocks Based on BlockIterator ...

[https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L628-L630]

 

... Loops Through the Storage Devices (Volumes) To Create Iterator ...

[https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L584-L592]

 

... Reconstruct Each Block 

[https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L716-L719]

 

Maybe the Iterator should round-robin the Blocks instead of linear iteration

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-29 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918948#comment-16918948
 ] 

David Mollitor commented on HDFS-13157:
---

Note to self:

 

I am confusing myself.  This issue has to do with replicating blocks off the 
DataNode, not issuing deletes to the DataNode.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-29 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918934#comment-16918934
 ] 

David Mollitor commented on HDFS-13157:
---

Note to self:

I see many places where the Blocks are stored in a Map of DataNode :: Set of 
Blocks.  I'm not so sure that it stores them in DataNode :: Volume :: Blocks, 
in which case the distribution should be pretty random since the Set does not 
guarantee any ordering.

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2019-08-29 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918836#comment-16918836
 ] 

David Mollitor commented on HDFS-13157:
---

I need to validate that blocks are actually being deleted in order, but it's 
likely.

 

Note to self:

Consider doing the shuffle here:

https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L729-L730

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: David Mollitor
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission

2018-10-26 Thread Mark Sloan (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665607#comment-16665607
 ] 

Mark Sloan commented on HDFS-13157:
---

Has anyone looked at this?

 

It seems like a random / round robin ordering of the block scheduling for 
removal over the available data volumes would allow for significantly better 
throughput and processes that don't take days/weeks to complete at the speed of 
a single data volume at a time. 

 

 

> Do Not Remove Blocks Sequentially During Decommission 
> --
>
> Key: HDFS-13157
> URL: https://issues.apache.org/jira/browse/HDFS-13157
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, namenode
>Affects Versions: 3.0.0
>Reporter: BELUGA BEHR
>Priority: Major
>
> From what I understand of [DataNode 
> decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java]
>  it appears that all the blocks are scheduled for removal _in order._. I'm 
> not 100% sure what the ordering is exactly, but I think it loops through each 
> data volume and schedules each block to be replicated elsewhere. The net 
> affect is that during a decommission, all of the DataNode transfer threads 
> slam on a single volume until it is cleaned out. At which point, they all 
> slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution 
> across all volumes when decommissioning a node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org