[ 
https://issues.apache.org/jira/browse/HDFS-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated HDFS-13739:
-------------------------------
    Description: 
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but there are at least 2 scenarios where this is not 
ideal:
 # Rack-by-Rack Upgrades at pose risk of losing last remaining replica. If a 
single data node failure happened it would likely cause some data outage or 
even data loss if the rack is lost or the upgrade fails (perhaps it's a 
complete rebuild upgrade). Setting replicas to 4 would reduce write performance 
and waste storage which is currently the only workaround to that issue.
 # when there is an uneven layout of datanodes across racks it can cause major 
storage imbalance across nodes with some nodes filling up and others being half 
empty.

I have observed this storage imbalance on a cluster where half the nodes were 
85% full and the other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.

Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.

  was:
Current HDFS write pattern of "local node, rack local node, other rack node" is 
good for most purposes but when there is an uneven layout of datanodes across 
racks it can cause major storage imbalance across nodes with some nodes filling 
up and others being half empty.

I have observed this on a cluster where half the nodes were 85% full and the 
other half were only 50% full.

Rack layouts like the following illustrate this - the nodes in the same rack 
will only choose to send half their block replicas to each other, so they will 
fill up first, while other nodes will receive far fewer replica blocks:
{code:java}
NumNodes - Rack 
2 - rack 1
2 - rack 2
1 - rack 3
1 - rack 4 
1 - rack 5
1 - rack 6{code}
In this case if I reduce the number of replicas to 2 then I get an almost 
perfect spread of blocks across all datanodes because HDFS has no choice but to 
maintain the only 2nd replica on a different rack. If I increase the replicas 
back to 3 it goes back to 85% on half the nodes and 50% on the other half, 
because the extra replicas choose to replicate only to rack local nodes.


 Why not just run the HDFS balancer to fix it you might say? This is a heavily 
loaded HBase cluster - aside from destroying HBase's data locality and 
performance by moving blocks out from underneath RegionServers - as soon as an 
HBase major compaction occurs (at least weekly), all blocks will get re-written 
by HBase and the HDFS client will again write to local node, rack local node, 
other rack node and resulting in the same storage imbalance again. Hence this 
cannot be solved by running HDFS balancer on HBase clusters - or for any 
application sitting on top of HDFS that has any HDFS block churn.


> Option to disable Rack Local Write Preference to avoid 2 issues - Whole Rack 
> Maintenance without risk of only 1 remaining replica, and avoid Major Storage 
> Imbalance across DataNodes caused by uneven spread of Datanodes across Racks
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-13739
>                 URL: https://issues.apache.org/jira/browse/HDFS-13739
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: balancer & mover, block placement, datanode, fs, 
> hdfs, hdfs-client, namenode, nn, performance
>    Affects Versions: 2.7.3
>         Environment: Hortonworks HDP 2.6
>            Reporter: Hari Sekhon
>            Priority: Major
>
> Current HDFS write pattern of "local node, rack local node, other rack node" 
> is good for most purposes but there are at least 2 scenarios where this is 
> not ideal:
>  # Rack-by-Rack Upgrades at pose risk of losing last remaining replica. If a 
> single data node failure happened it would likely cause some data outage or 
> even data loss if the rack is lost or the upgrade fails (perhaps it's a 
> complete rebuild upgrade). Setting replicas to 4 would reduce write 
> performance and waste storage which is currently the only workaround to that 
> issue.
>  # when there is an uneven layout of datanodes across racks it can cause 
> major storage imbalance across nodes with some nodes filling up and others 
> being half empty.
> I have observed this storage imbalance on a cluster where half the nodes were 
> 85% full and the other half were only 50% full.
> Rack layouts like the following illustrate this - the nodes in the same rack 
> will only choose to send half their block replicas to each other, so they 
> will fill up first, while other nodes will receive far fewer replica blocks:
> {code:java}
> NumNodes - Rack 
> 2 - rack 1
> 2 - rack 2
> 1 - rack 3
> 1 - rack 4 
> 1 - rack 5
> 1 - rack 6{code}
> In this case if I reduce the number of replicas to 2 then I get an almost 
> perfect spread of blocks across all datanodes because HDFS has no choice but 
> to maintain the only 2nd replica on a different rack. If I increase the 
> replicas back to 3 it goes back to 85% on half the nodes and 50% on the other 
> half, because the extra replicas choose to replicate only to rack local nodes.
> Why not just run the HDFS balancer to fix it you might say? This is a heavily 
> loaded HBase cluster - aside from destroying HBase's data locality and 
> performance by moving blocks out from underneath RegionServers - as soon as 
> an HBase major compaction occurs (at least weekly), all blocks will get 
> re-written by HBase and the HDFS client will again write to local node, rack 
> local node, other rack node and resulting in the same storage imbalance 
> again. Hence this cannot be solved by running HDFS balancer on HBase clusters 
> - or for any application sitting on top of HDFS that has any HDFS block churn.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to