Hi community,

I am currently drafting a HDFS scalability guideline doc, and I'd like to
understand any data points regarding HDFS scalability limit. I'd like to
share it publicly eventually.

As an example, through my workplace, and through community chatters, I am
aware that HDFS is capable of operating at the following scale:

Number of DataNodes:
Unfederated: I can reasonably believe a single HDFS NameNode can manage up
to 4000 DataNodes. Is there any one who would like to share an even larger
cluster?

Federated: I am aware of one federated HDFS cluster composed of 20,000
DataNodes. JD.com
<https://conferences.oreilly.com/strata/strata-eu-2018/public/schedule/detail/64692>
has a 15K DN cluster and 210PB total capacity. I suspect it's a federated
HDFS cluster.

Number of blocks & files:
500 million files&blocks seems to be the upper limit at this point. At this
scale NameNode consumes around 200GB heap, and my experience told me any
number beyond 200GB is unstable. But at some point I recalled some one
mentioned a 400GB NN heap.

Amount of Data:
I am aware a few clusters more than 100PB in size (federated or not) --
Uber, Yahoo Japan, JD.com.

Number of Volumes in a DataNode:
DataNodes with 24 volumes is known to work reasonably well. If DataNode is
used for archival use cases, a DN can have up to 48 volumes. This is
certainly hardware dependent, but if I know where the current limit is, I
can start optimizing the software.

Total disk space:
CDH
<https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html#concept_fzz_dq4_gbb>
recommends no more than 100TB per DataNode. Are there successful
deployments that install more than this number? Of course, you can easily
exceed this number if it is used purely for data archival.


What are other scalability limits that people are interested?

Best,
Wei-Chiu

Reply via email to