Hi Bryan, It is a bug that we calculate the average including marked-for-decomm. DNs. Please do log a JIRA for this!
On Fri, Jan 17, 2014 at 4:07 AM, Bryan Beaudreault <bbeaudrea...@hubspot.com> wrote: > Running CDH4.2 version of HDFS, I may have found a bug in the > dfs.namenode.replication.considerLoad feature. I would like to query here > before entering a JIRA (a quick search was fruitless). > > I have an HBase cluster in which I recently replaced 70 smaller servers with > 22 larger ones. I added the 22 to the cluster, moved all of the HBase > regions to the new servers, major compacted to re-write locally, then used > the HDFS decommission to decommission the 65 smaller servers. > > This all worked well, and HBase was happy. > > However, later on, after the decommission finished, I tried to write a file > to HDFS from a node that does NOT have a DataNode running on it (HMaster). > These operations failed because all 92 servers were being set to excluded. > https://gist.github.com/bbeaudreault/49c8aa4bb231de54e9c1 for logs. > > Reading through the code, I found that the DefaultBlockPlacementPolicy > calculates the load average of the cluster by doing: TotalClusterLoad / > numNodes. However, numNodes includes decommissioned nodes (which have 0 > load). Therefore, the average load is artificially low. Example: > > TotalLoad = 250 > numNodes = 92 > decommissionedNodes = 70 > > avgLoad = 250/92 = 2.71 > trueAvgLoad = 250 / (92 - 70) = 11.36 > > Because of this math, all of our remaining 22 nodes were considered > "overloaded", as they were all more than 2x 2.71. That, with the > decommissioned nodes already excluded, results in all servers being > excluded. > > (Looking at logs of my regionservers later I did see that a bunch of writes > were not able to reach their required replication factor as well, though did > not fail this spectacularly). > > Is this a bug or expected behavior? -- Harsh J