Gopal V created HDFS-8278:
-----------------------------

             Summary: HDFS Balancer should consider remaining storage % when 
checking for under-utilized machines
                 Key: HDFS-8278
                 URL: https://issues.apache.org/jira/browse/HDFS-8278
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: balancer & mover
    Affects Versions: 2.8.0
            Reporter: Gopal V


DFS balancer mistakenly identifies a node with very little storage space 
remaining as an "underutilized" node and tries to move large amounts of data to 
that particular node.

All these block moves fail to execute successfully, as the % utilization is 
less relevant than the dfs remaining storage on that node.

{code}
15/04/24 04:25:55 INFO balancer.Balancer: 0 over-utilized: []
15/04/24 04:25:55 INFO balancer.Balancer: 1 underutilized: 
[172.19.1.46:50010:DISK]
15/04/24 04:25:55 INFO balancer.Balancer: Need to move 47.68 GB to make the 
cluster balanced.
15/04/24 04:25:55 INFO balancer.Balancer: Decided to move 413.08 MB bytes from 
172.19.1.52:50010:DISK to 172.19.1.46:50010:DISK
15/04/24 04:25:55 INFO balancer.Balancer: Will move 413.08 MB in this iteration
15/04/24 04:25:55 WARN balancer.Dispatcher: Failed to move 
blk_1078689321_1099517353638 with size=131146 from 172.19.1.52:50010:DISK to 
172.19.1.46:50010:DISK through 172.19.1.53:50010: Got error, status message 
opReplaceBlock 
BP-942051088-172.18.1.41-1370508013893:blk_1078689321_1099517353638 received 
exception org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of 
space: The volume with the most available space (=225042432 B) is less than the 
block size (=268435456 B)., block move is failed
{code}

The machine in concern is under-full when it comes to the BP utilization, but 
has very little free space available for blocks.

{code}
Decommission Status : Normal
Configured Capacity: 3826907185152 (3.48 TB)
DFS Used: 2817262833664 (2.56 TB)
Non DFS Used: 1000621305856 (931.90 GB)
DFS Remaining: 9023045632 (8.40 GB)
DFS Used%: 73.62%
DFS Remaining%: 0.24%
Configured Cache Capacity: 8589934592 (8 GB)
Cache Used: 0 (0 B)
Cache Remaining: 8589934592 (8 GB)
Cache Used%: 0.00%
Cache Remaining%: 100.00%
Xceivers: 3
Last contact: Fri Apr 24 04:28:36 PDT 2015
{code}

The machine has 0.40 Gb of non-RAM storage available on that node, so it is 
futile to attempt to move any blocks to that particular machine.

This is a similar concern when a machine loses disks, since the comparisons of 
utilization always compare percentages per-node. Even that scenario needs to  
placing the maximum cap of data movement to that node to the "DFS Remaining %" 
variable.

Trying to move any more data than that to a given node will always fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to