[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

Daryn Sharp (Jira) Mon, 26 Jul 2021 11:40:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387540#comment-17387540
 ]


Daryn Sharp commented on HDFS-14703:
------------------------------------

I applaud this effort but I have concerns about real world scalability.  This 
may positively affect a synthetic benchmark, small clusters with a small 
namespace, or a presumed deterministic data creation rate and locality, but…. 
How will this scale to namespaces up to almost 1 billion total objects with 
250-500 million inodes performing an avg 20-40k ops/sec with occasional spikes 
exceeding 100k ops/sec, with up to 600 active applications?
{quote}entries of the same directory are partitioned into the same range most 
of the time. “Most of the time” means that very large directories containing 
too many children or some of boundary directories can still span across 
multiple partitions.
{quote}
How many is “too many children”? It’s not uncommon for directories to contain 
thousands or even tens/hundreds of thousands of files.

Jobs with large dags that run for minutes, hours, or days seem likely to result 
to violate “most of the time” and create high fragmentation across partitions. 
Task time shift caused by queue resource availability, speculation, preemption, 
etc will violate the premise of inodes neatly clustering into partitions based 
on creation time.

How will this handle things like IBR processing which includes many blocks 
spread across multiple partitions? Especially during a replication flurry 
caused by rack loss or multi-rack decommissioning (over a hundred hosts)?

How will live lock conditions be resolved that result from multiple ops needing 
to lock multiple overlapping partitions? Managing that dependency graph might 
wipe out real world improvements at scale.
{quote}An analysis shows that the two locks are held simultaneously most of the 
time, making one of them redundant. We suggest removing the FSDirectoryLock.
{quote}
Please do not remove the fsdir lock. While the fsn and fsd lock are generally 
redundant, we have internal changes for operations like the horribly expensive 
content summary to not hold the fsn lock after resolving the path. I’ve been 
investigating whether some other operations can safely release the fsn lock or 
downgrade from a fsn write to read lock after acquiring the fsd lock.
{quote}Particularly, this means that each BlocksMap partition contains all 
blocks of files in the corresponding INodeMap partition.
{quote}
If the namespace is “unexpectedly” fragmented across multiple partitions per 
above, what further effect will this have on data skew (blocks per files) in 
the partition? Users generate an assortment of relatively small files plus 
multi-GB or TB files. A directory tree may contain combinations of dirs 
containing a mixture of anything from minutes/hourly/daily/weekly/monthly 
rollup data. This sort of partitioning seems likely to result in further 
lopsided partitioning within the blocks map?
{quote}We ran NNThroughputBenchmark for mkdir() operation creating 10 million 
directories with 200 concurrent threads. The results show
 1. 30-40% improvement in throughput compared to current INodeMap implementation
{quote}
Can you please provide more context?
 # Was the entry point for the calls via the rpc server, fsn, fsdir, etc? 
Relevant since end-to-end benchmarking rarely matches microbenchmarks.
 # What is “30-40%” improvement? How many ops/sec before and after?
 # Did the threads create dirs in a partition-friendly order? As in sequential 
creation under the same dir trees?
 # What impact did it have on gc/min and gc time? These are often hidden 
killers of performance when not taken into consideration.

> NameNode Fine-Grained Locking via Metadata Partitioning
> -------------------------------------------------------
>
>                 Key: HDFS-14703
>                 URL: https://issues.apache.org/jira/browse/HDFS-14703
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs, namenode
>            Reporter: Konstantin Shvachko
>            Priority: Major
>         Attachments: 001-partitioned-inodeMap-POC.tar.gz, 
> 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, 
> NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

Reply via email to