[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

Konstantin Shvachko (JIRA) Thu, 08 Aug 2019 15:11:46 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903374#comment-16903374
 ]


Konstantin Shvachko edited comment on HDFS-14703 at 8/8/19 10:10 PM:
---------------------------------------------------------------------

Hi [~hexiaoqiao], thanks for reviewing the doc. Very good questions:
# "Cousins" means files like {{/a/b/c/d}} and {{/a/b/m/n}}. They will have 
keys, respectively, {{<idb, idc, idd>}} and {{<idb, idm, idn>}}, which have 
common prefix {{<idb>}} and therefore are likely to fall into the same 
RangeGSet. In your example {{<ida, idb, idc>}} is the parent of {{<idb, idc, 
idd>}} and this key definition does not guarantee them to be in the same range.
# Deleting a directory {{/a/b/c}} means deleting the entire sub-tree underneath 
this directory. We should lock all RangeGSets involved in such deletion, 
particularly the one containing file {{f}}. So {{f}} cannot be modified 
concurrently with the delete.
# Just to clarify RangeMap is the upper level part of PartitionedGSet, which 
maps key ranges into RangeGSets. So there is only one RangeMap and many 
RangeGSets. Holding a lock on RangeMap is akin to holding a global lock. You 
make a good point that some operations like failover, large deletes, renames, 
quota changes will still require a global lock. The lock on RangeMap could play 
the role of such global lock. This should be defined in more details within the 
design of LatchLock. Ideally we should retain FSNamesystemLock as a global lock 
for some operations. This will also help us gradually switch operations from 
FSNamesystemLock to LatchLock.
# I don't know what the next bottleneck we will see, but you are absolutely 
correct there will be something. For edits log, I indeed saw while running my 
benchmarks that the number of transactions batched together while journaling 
was increasing. This is expected and desirable behavior, since writing large 
batches to a disk is more efficient than lots of small writes.


was (Author: shv):
Hi [~hexiaoqiao], thanks for reviewing the doc. Very good questions:
# "Cousins" means files like {{/a/b/c/d}} and {{/a/b/m/n}}. They will have 
keys, respectively, {{<idb, idc, idd>}} and {{<idb, idm, idn>}}, which have 
common prefix {{<idb>}} and therefore are likely to fall into the same 
RangeGSet. In your example {{<ida, idb, idc>}} is the parent of {{<idb, idc, 
idd>}} and this key definition does not guarantee them to be in the same range.
# Deleting a directory {{/a/b/c}} means deleting the entire sub-tree underneath 
this directory. We should lock all RangeGSets involved in such deletion, 
particularly the one containing containing file {{f}}. So {{f}} cannot be 
modified concurrently with the delete.
# Just to clarify RangeMap is the upper level part of PartitionedGSet, which 
maps key ranges into RangeGSets. So there is only one RangeMap and many 
RangeGSets. Holding a lock on RangeMap is akin to holding a global lock. You 
make a good point that some operations like failover, large deletes, renames, 
quota changes will still require a global lock. The lock on RangeMap could play 
the role of such global lock. This should be defined in more details within the 
design of LatchLock. Ideally we should retain FSNamesystemLock as a global lock 
for some operations. This will also help us gradually switch operations from 
FSNamesystemLock to LatchLock.
# I don't know what the next bottleneck we will see, but you are absolutely 
correct there will be something. For edits log, I indeed saw while running my 
benchmarks that the number of transactions batched together while journaling 
was increasing. This is expected and desirable behavior, since writing large 
batches to a disk is more efficient than lots of small writes.

> NameNode Fine-Grained Locking via Metadata Partitioning
> -------------------------------------------------------
>
>                 Key: HDFS-14703
>                 URL: https://issues.apache.org/jira/browse/HDFS-14703
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs, namenode
>            Reporter: Konstantin Shvachko
>            Priority: Major
>         Attachments: NameNode Fine-Grained Locking.pdf
>
>
> We target to enable fine-grained locking by splitting the in-memory namespace 
> into multiple partitions each having a separate lock. Intended to improve 
> performance of NameNode write operations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning

Reply via email to