[jira] [Commented] (HDFS-8278) HDFS Balancer should consider remaining storage % when checking for under-utilized machines

Tsz Wo Nicholas Sze (JIRA) Mon, 17 Aug 2015 11:00:11 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-8278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699940#comment-14699940
 ]


Tsz Wo Nicholas Sze commented on HDFS-8278:
-------------------------------------------

Balancer does consider remaining storage, which is used to compute 
max-size-to-move.  The problem here is that datanode will throw 
DiskOutOfSpaceException if there is no space for a full block.  In the 
description, the required size is only 131146 (~= 128k) but default block size 
is 268435456 (=256M).
{code}
15/04/24 04:25:55 WARN balancer.Dispatcher: Failed to move 
blk_1078689321_1099517353638 with size=131146 from 172.19.1.52:50010:DISK to 
172.19.1.46:50010:DISK through 172.19.1.53:50010: Got error, status message 
opReplaceBlock 
BP-942051088-172.18.1.41-1370508013893:blk_1078689321_1099517353638 received 
exception org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of 
space: The volume with the most available space (=225042432 B) is less than the 
block size (=268435456 B)., block move is failed
{code}

> HDFS Balancer should consider remaining storage % when checking for 
> under-utilized machines
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8278
>                 URL: https://issues.apache.org/jira/browse/HDFS-8278
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: balancer & mover
>    Affects Versions: 2.8.0
>            Reporter: Gopal V
>            Assignee: Tsz Wo Nicholas Sze
>
> DFS balancer mistakenly identifies a node with very little storage space 
> remaining as an "underutilized" node and tries to move large amounts of data 
> to that particular node.
> All these block moves fail to execute successfully, as the % utilization is 
> less relevant than the dfs remaining storage on that node.
> {code}
> 15/04/24 04:25:55 INFO balancer.Balancer: 0 over-utilized: []
> 15/04/24 04:25:55 INFO balancer.Balancer: 1 underutilized: 
> [172.19.1.46:50010:DISK]
> 15/04/24 04:25:55 INFO balancer.Balancer: Need to move 47.68 GB to make the 
> cluster balanced.
> 15/04/24 04:25:55 INFO balancer.Balancer: Decided to move 413.08 MB bytes 
> from 172.19.1.52:50010:DISK to 172.19.1.46:50010:DISK
> 15/04/24 04:25:55 INFO balancer.Balancer: Will move 413.08 MB in this 
> iteration
> 15/04/24 04:25:55 WARN balancer.Dispatcher: Failed to move 
> blk_1078689321_1099517353638 with size=131146 from 172.19.1.52:50010:DISK to 
> 172.19.1.46:50010:DISK through 172.19.1.53:50010: Got error, status message 
> opReplaceBlock 
> BP-942051088-172.18.1.41-1370508013893:blk_1078689321_1099517353638 received 
> exception org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of 
> space: The volume with the most available space (=225042432 B) is less than 
> the block size (=268435456 B)., block move is failed
> {code}
> The machine in concern is under-full when it comes to the BP utilization, but 
> has very little free space available for blocks.
> {code}
> Decommission Status : Normal
> Configured Capacity: 3826907185152 (3.48 TB)
> DFS Used: 2817262833664 (2.56 TB)
> Non DFS Used: 1000621305856 (931.90 GB)
> DFS Remaining: 9023045632 (8.40 GB)
> DFS Used%: 73.62%
> DFS Remaining%: 0.24%
> Configured Cache Capacity: 8589934592 (8 GB)
> Cache Used: 0 (0 B)
> Cache Remaining: 8589934592 (8 GB)
> Cache Used%: 0.00%
> Cache Remaining%: 100.00%
> Xceivers: 3
> Last contact: Fri Apr 24 04:28:36 PDT 2015
> {code}
> The machine has 0.40 Gb of non-RAM storage available on that node, so it is 
> futile to attempt to move any blocks to that particular machine.
> This is a similar concern when a machine loses disks, since the comparisons 
> of utilization always compare percentages per-node. Even that scenario needs 
> to cap data movement to that node to the "DFS Remaining %" variable.
> Trying to move any more data than that to a given node will always fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-8278) HDFS Balancer should consider remaining storage % when checking for under-utilized machines

Reply via email to