[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387540#comment-17387540 ]
Daryn Sharp commented on HDFS-14703: ------------------------------------ I applaud this effort but I have concerns about real world scalability. This may positively affect a synthetic benchmark, small clusters with a small namespace, or a presumed deterministic data creation rate and locality, but…. How will this scale to namespaces up to almost 1 billion total objects with 250-500 million inodes performing an avg 20-40k ops/sec with occasional spikes exceeding 100k ops/sec, with up to 600 active applications? {quote}entries of the same directory are partitioned into the same range most of the time. “Most of the time” means that very large directories containing too many children or some of boundary directories can still span across multiple partitions. {quote} How many is “too many children”? It’s not uncommon for directories to contain thousands or even tens/hundreds of thousands of files. Jobs with large dags that run for minutes, hours, or days seem likely to result to violate “most of the time” and create high fragmentation across partitions. Task time shift caused by queue resource availability, speculation, preemption, etc will violate the premise of inodes neatly clustering into partitions based on creation time. How will this handle things like IBR processing which includes many blocks spread across multiple partitions? Especially during a replication flurry caused by rack loss or multi-rack decommissioning (over a hundred hosts)? How will live lock conditions be resolved that result from multiple ops needing to lock multiple overlapping partitions? Managing that dependency graph might wipe out real world improvements at scale. {quote}An analysis shows that the two locks are held simultaneously most of the time, making one of them redundant. We suggest removing the FSDirectoryLock. {quote} Please do not remove the fsdir lock. While the fsn and fsd lock are generally redundant, we have internal changes for operations like the horribly expensive content summary to not hold the fsn lock after resolving the path. I’ve been investigating whether some other operations can safely release the fsn lock or downgrade from a fsn write to read lock after acquiring the fsd lock. {quote}Particularly, this means that each BlocksMap partition contains all blocks of files in the corresponding INodeMap partition. {quote} If the namespace is “unexpectedly” fragmented across multiple partitions per above, what further effect will this have on data skew (blocks per files) in the partition? Users generate an assortment of relatively small files plus multi-GB or TB files. A directory tree may contain combinations of dirs containing a mixture of anything from minutes/hourly/daily/weekly/monthly rollup data. This sort of partitioning seems likely to result in further lopsided partitioning within the blocks map? {quote}We ran NNThroughputBenchmark for mkdir() operation creating 10 million directories with 200 concurrent threads. The results show 1. 30-40% improvement in throughput compared to current INodeMap implementation {quote} Can you please provide more context? # Was the entry point for the calls via the rpc server, fsn, fsdir, etc? Relevant since end-to-end benchmarking rarely matches microbenchmarks. # What is “30-40%” improvement? How many ops/sec before and after? # Did the threads create dirs in a partition-friendly order? As in sequential creation under the same dir trees? # What impact did it have on gc/min and gc time? These are often hidden killers of performance when not taken into consideration. > NameNode Fine-Grained Locking via Metadata Partitioning > ------------------------------------------------------- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode > Reporter: Konstantin Shvachko > Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org