[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387797#comment-17387797 ] Xing Lin edited comment on HDFS-14703 at 7/27/21, 6:07 AM: --- [~daryn] Thanks for your comments. I will address your last question and leave other questions to [~shv]. :) Regarding the results, we used the standard NNThroughputBenchmark, with commands like the following. {code:java} ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs file:/// -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512{code} Here are a result from [~prasad-acit], since his QPS numbers are higher than what I got. {code:java} BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: — mkdirs inputs — 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: — mkdirs inputs — 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 66622.25183211193 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254 {code} was (Author: xinglin): [~daryn] Thanks for your comments. I will address your last question and leave other questions to [~shv]. :) Regarding the results, we used the standard NNThroughputBenchmark, with commands like the following. ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* [*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 Here are a result from [~prasad-acit], since his QPS numbers are higher than what I got. BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 66622.25183211193 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870 ] Konstantin Shvachko edited comment on HDFS-14703 at 7/14/21, 5:28 PM: -- I did some performance benchmarks using a physical server (a d430 server in [Utah Emulab testbed|http://www.emulab.net]). I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h2. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h3. 45% improvements fgl vs. trunk trunk {noformat:nowrap} 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 {noformat} fgl {noformat:nowrap} 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 445980 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Ops per sec: 22422.530 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8 {noformat} h2. SSD, hadoop.tmp.dir=/dev/sda4 h3. 23% improvement fgl vs. trunk trunk: {noformat:nowrap} 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 593839 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Ops per sec: 16839.581 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11 {noformat:nowrap} fgl {noformat:nowrap} 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 481269 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Ops per sec: 20778.400 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9 {noformat} {noformat:nowrap} /dev/sda: ATA device, with non-removable media Model Number: INTEL SSDSC2BX200G4R Serial Number: BTHC523202RD200TGN Firmware Revision: G201DL2D {noformat} was (Author: xinglin): I did some performance benchmarks using a physical server (a d430 server in [Utah Emulab testbed|http://www.emulab.net]). I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359306#comment-17359306 ] Renukaprasad C edited comment on HDFS-14703 at 6/8/21, 12:17 PM: - [~shv]/ [~xinglin] We have implemented FGL for Create API and done basic testing & captured the performance reading. With the create API we could see around 25% of improvement. I have created PR - [https://github.com/apache/hadoop/pull/3013] for the same. Can you please review & feedback when you get time? Command: /hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs [file:///] -op create -threads 200 -files 100 -filesPerDir 40 Result: ||Iteration||Base||Patch|| |Itr-1|27124|32712| |Itr-2|26460|31312| |Itr-3|24166|32276| |Avg|25916.66|32100| |Improvement| |23.86| was (Author: prasad-acit): [~shv]/ [~xinglin] We have implemented FGL for Create API and done basic testing & captured the performance reading. With the create API we could see around 25% of improvement. I have created PR - [https://github.com/apache/hadoop/pull/3013] for the same. Can you please review & feedback when you get time? Command: /hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs file:/// -op create -threads 200 -files 100 -filesPerDir 40 Result: ||Iteration||Heading 1||Heading 2|| |Itr-1|27124|32712| |Itr-2|26460|31312| |Itr-3|24166|32276| |Avg|25916.66|32100| |Improvement| |23.86| > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346002#comment-17346002 ] Renukaprasad C edited comment on HDFS-14703 at 5/17/21, 10:56 AM: -- Thanks [~shv] & [~xinglin] We have tested on 48 Core physical machine & could see significant performance improvement with the patch. On average improvement is +*around 30%*+ with default storage policy. ||Itr||Base||Patch|| |ITR-1|56439|66622| |ITR-2|58092|65074| |ITR-3|60132|74354| |ITR-4|52056|76522| |ITR-5|55478|65526| |ITR-6|60664|76881| |AVG|56066|72976| h3. +Improvement 30.16+ Attached few results: {code:java} BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Elapsed Time: 15010 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Ops per sec: 66622.25183211193 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: Average Time: 2 2021-05-17 11:11:09,331 INFO namenode.FSEditLog: Ending log segment 1, 1031254 {code} Command: ./hadoop jar ../share/hadoop/common/hadoop-hdfs-3.1.1-hw-ei-SNAPSHOT-tests.jar org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs [file:///] -op mkdirs -threads 200 -dirs 100 -dirsPerDir 32 Hw configuration: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz Stepping: 2 CPU MHz: 2600.406 CPU max MHz: 3500. CPU min MHz: 1200. BogoMIPS: 5189.51 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0-11,24-35 NUMA node1 CPU(s): 12-23,36-47 was (Author: prasad-acit): Thanks [~shv] & [~xinglin] I have tested on 48 Core physical machine & could see significant performance improvement with the patch. On average improvement is around 30% with default storage policy. BasePatch ITR-1 56439 66622 ITR-2 58092 65074 ITR-3 60132 74354 ITR-4 52056 76522 ITR-5 55478 65526 ITR-6 60664 76881 AVG 56066 72976.3 Improvement 30.16147636 Attached few results: {code:java} BASE: common/hadoop-hdfs-32021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: # operations: 100 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Elapsed Time: 17718 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Ops per sec: 56439.77875606727 2021-05-17 11:17:36,973 INFO namenode.NNThroughputBenchmark: Average Time: 3 2021-05-17 11:17:36,973 INFO namenode.FSEditLog: Ending log segment 1, 1031254 PATCH: 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirs = 100 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 32 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-17 11:11:09,321 INFO namenode.NNThroughputBenchmark: # operations: 100
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870 ] Xing Lin edited comment on HDFS-14703 at 5/17/21, 4:42 AM: --- I did some performance benchmarks using a physical server (a d430 server in [Utah Emulab testbed|http://www.emulab.net]). I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 fgl 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 445980 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Ops per sec: 22422.530 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8 h1. SSD, hadoop.tmp.dir=/dev/sda4 h1. 23% improvement fgl vs. trunk trunk: 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 593839 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Ops per sec: 16839.581 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11 fgl 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 481269 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Ops per sec: 20778.400 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9 /dev/sda: ATA device, with non-removable media Model Number: INTEL SSDSC2BX200G4R Serial Number: BTHC523202RD200TGN Firmware Revision: G201DL2D was (Author: xinglin): I did some performance benchmarks using a physical server (a d430 server in [Utah Emulab testbed|www.emulab.net]). I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 fgl
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345870#comment-17345870 ] Xing Lin edited comment on HDFS-14703 at 5/17/21, 4:41 AM: --- I did some performance benchmarks using a physical server (a d430 server in [Utah Emulab testbed|www.emulab.net]). I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment 8345565, 10019540 fgl 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Elapsed Time: 445980 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Ops per sec: 22422.530 2021-05-16 21:06:46,476 INFO namenode.NNThroughputBenchmark: Average Time: 8 h1. SSD, hadoop.tmp.dir=/dev/sda4 h1. 23% improvement fgl vs. trunk trunk: 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Elapsed Time: 593839 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Ops per sec: 16839.581 2021-05-16 21:59:06,042 INFO namenode.NNThroughputBenchmark: Average Time: 11 fgl 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Elapsed Time: 481269 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Ops per sec: 20778.400 2021-05-16 21:21:03,906 INFO namenode.NNThroughputBenchmark: Average Time: 9 /dev/sda: ATA device, with non-removable media Model Number: INTEL SSDSC2BX200G4R Serial Number: BTHC523202RD200TGN Firmware Revision: G201DL2D was (Author: xinglin): I did some performance benchmarks using a physical server (a d430 server in [utah Emulab testbed|[www.emulab.net].) |http://www.emulab.net].%29/]I used either RAMDISK or SSD, as the storage for HDFS. By using RAMDISK, we can remove the time used by the SSD to make each write persistent. For the RAM case, we observed an improvement of 45% from fine-grained locking. For the SSD case, fine-grained locking gives us 20% improvement. We used an Intel SSD (model: SSDSC2BX200G4R). We noticed for trunk, the mkdir OPS is lower for the RAMDISK than SSD. We don't know the reason for this yet. We repeated the experiment for RAMDISK for trunk twice to confirm the performance number. h1. tmpfs, hadoop-tmp-dir = /run/hadoop-utos h1. 45% improvements fgl vs. trunk h2. trunk 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Elapsed Time: 663510 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Ops per sec: 15071.362 2021-05-16 20:37:20,280 INFO namenode.NNThroughputBenchmark: Average Time: 13 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Elapsed Time: 710248 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Ops per sec: 14079.5 2021-05-16 22:15:13,515 INFO namenode.NNThroughputBenchmark: Average Time: 14 2021-05-16 22:15:13,515 INFO namenode.FSEditLog: Ending log segment
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345069#comment-17345069 ] Xing Lin edited comment on HDFS-14703 at 5/15/21, 3:55 PM: --- [~prasad-acit] try this command: use -fs [file:///], instead of hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give you higher numbers at your VM. I use the default partition size of 256. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs* [*file:///*|file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 was (Author: xinglin): [~prasad-acit] try this command: use -fs [file:///], instead of hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give you higher numbers at your VM. I use the default partition size of 256. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs [file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345069#comment-17345069 ] Xing Lin edited comment on HDFS-14703 at 5/15/21, 3:53 PM: --- [~prasad-acit] try this command: use -fs [file:///], instead of hdfs://server:port. "-fs [file:///]" will bypass the RPC layer and should give you higher numbers at your VM. I use the default partition size of 256. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs [file:///*] -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 was (Author: xinglin): [~prasad-acit] try this command: use -fs file:///, instead of hdfs://server:port. "-fs file:///" will bypass the RPC layer and should give you higher numbers at your VM. dir: /home/xinglin/projs/hadoop/hadoop-dist/target/hadoop-3.4.0-SNAPSHOT $ ./bin/hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark *-fs file:///* -op mkdirs -threads 200 -dirs 1000 -dirsPerDir 512 > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344970#comment-17344970 ] Xing Lin edited comment on HDFS-14703 at 5/15/21, 6:06 AM: --- [~prasad-acit] how many CPU cores does your server have? The OPS per sec seems rather low, than I got from my Mac laptop (with 8 cores). fgl gives us 10% improvement running on my Mac. We will find some proper hardware to do more serious performance benchmarks. *Trunk* 021-05-11 09:52:35,666 INFO namenode.NNThroughputBenchmark: 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: — mkdirs inputs — 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrDirs = 1000 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 512 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Elapsed Time: 542905 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Ops per sec: 18419.42881351249 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Average Time: 10 2021-05-11 09:52:35,667 INFO namenode.FSEditLog: Ending log segment 5488830, 10019538 2021-05-11 09:52:35,670 INFO namenode.FSEditLog: Number of transactions: 4530710 Total time for transactions(ms): 14288 Number of transactions batched in Syncs: 4452444 Number of syncs: 78267 SyncTimes(ms): 200575 *fgl* 021-05-11 10:58:40,142 INFO namenode.NNThroughputBenchmark: 2021-05-11 10:58:40,142 INFO namenode.NNThroughputBenchmark: — mkdirs inputs — 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrDirs = 1000 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 512 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: — mkdirs stats — 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Elapsed Time: 505892 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Ops per sec: 19767.06490713433 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Average Time: 10 2021-05-11 10:58:40,143 INFO namenode.FSEditLog: Ending log segment 5826307, 10019538 2021-05-11 10:58:40,146 INFO namenode.FSEditLog: Number of transactions: 4193233 Total time f or transactions(ms): 13990 Number of transactions batched in Syncs: 4130972 Number of syncs: 62262 SyncTimes(ms): 168203 was (Author: xinglin): [~prasad-acit] how many CPU cores does your server have? The OPS per sec seems rather low, than I got from my Mac laptop (with 8 cores). fgl gives us 10% improvement running on my Mac. We will find some proper hardware to do more serial performance benchmarks. *Trunk* 021-05-11 09:52:35,666 INFO namenode.NNThroughputBenchmark: 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrDirs = 1000 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 512 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Elapsed Time: 542905 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Ops per sec: 18419.42881351249 2021-05-11 09:52:35,667 INFO namenode.NNThroughputBenchmark: Average Time: 10 2021-05-11 09:52:35,667 INFO namenode.FSEditLog: Ending log segment 5488830, 10019538 2021-05-11 09:52:35,670 INFO namenode.FSEditLog: Number of transactions: 4530710 Total time for transactions(ms): 14288 Number of transactions batched in Syncs: 4452444 Number of syncs: 78267 SyncTimes(ms): 200575 *fgl* 021-05-11 10:58:40,142 INFO namenode.NNThroughputBenchmark: 2021-05-11 10:58:40,142 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrDirs = 1000 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 512 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: # operations: 1000 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Elapsed Time: 505892 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark: Ops per sec: 19767.06490713433 2021-05-11 10:58:40,143 INFO namenode.NNThroughputBenchmark:
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343725#comment-17343725 ] Renukaprasad C edited comment on HDFS-14703 at 5/13/21, 6:32 AM: - [~shv] Thanks for sharing the patch. I tried to test the pach applied on Trunk, results found similar with & without patch. I have attached results for both the results below. Did I miss something? With Patch: {code:java} ~/hadoop-3.4.0-SNAPSHOT/bin$ ./hdfs org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs hdfs://localhost:9000 -op mkdirs -threads 200 -dirs 200 -dirsPerDir 128 2021-05-13 01:57:41,279 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2021-05-13 01:57:41,976 INFO namenode.NNThroughputBenchmark: Starting benchmark: mkdirs 2021-05-13 01:57:42,065 INFO namenode.NNThroughputBenchmark: Generate 200 inputs for mkdirs 2021-05-13 01:57:43,385 INFO namenode.NNThroughputBenchmark: Log level = ERROR 2021-05-13 01:57:44,079 INFO namenode.NNThroughputBenchmark: Starting 200 mkdirs(s). 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrDirs = 200 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: # operations: 200 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: Elapsed Time: 1095122 2021-05-13 02:15:59,958 INFO namenode.NNThroughputBenchmark: Ops per sec: 1826.2805422592187 2021-05-13 02:15:59,959 INFO namenode.NNThroughputBenchmark: Average Time: 108 {code} Without Patch: {code:java} /hadoop-3.4.0-SNAPSHOT/bin$ ./hdfs org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs hdfs://localhost:9000 -op mkdirs -threads 200 -dirs 200 -dirsPerDir 128 2021-05-13 03:25:53,243 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2021-05-13 03:25:54,046 INFO namenode.NNThroughputBenchmark: Starting benchmark: mkdirs 2021-05-13 03:25:54,117 INFO namenode.NNThroughputBenchmark: Generate 200 inputs for mkdirs 2021-05-13 03:25:55,076 INFO namenode.NNThroughputBenchmark: Log level = ERROR 2021-05-13 03:25:55,163 INFO namenode.NNThroughputBenchmark: Starting 200 mkdirs(s). 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrDirs = 200 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: # operations: 200 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: Elapsed Time: 1064420 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: Ops per sec: 1878.9575543488472 2021-05-13 03:43:40,125 INFO namenode.NNThroughputBenchmark: Average Time: 105 {code} Similar results achived with when i tried with "file" as well, but this case Partitions were empty. {code:java} ./hdfs org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs file:/// -op mkdirs -threads 200 -dirs 200 -dirsPerDir 128 ... 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: --- mkdirs inputs --- 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrDirs = 200 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrThreads = 200 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: nrDirsPerDir = 128 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: --- mkdirs stats --- 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: # operations: 200 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: Elapsed Time: 845625 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: Ops per sec: 2365.1145602365114 2021-05-13 09:20:36,921 INFO namenode.NNThroughputBenchmark: Average Time: 84 2021-05-13 09:20:36,922 INFO namenode.FSEditLog: Ending log segment 1465676, 2015633 2021-05-13 09:20:36,987 INFO namenode.FSEditLog: Number of transactions: 549959 Total time for transactions(ms): 2840 Number of transactions batched in Syncs: 545346 Number of syncs: 4614 SyncTimes(ms): 240432 2021-05-13 09:20:36,996 INFO namenode.FileJournalManager: Finalizing edits file
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341089#comment-17341089 ] Konstantin Shvachko edited comment on HDFS-14703 at 5/10/21, 7:30 PM: -- Updated the POC patches to current trunk. There were indeed some missing parts in the first patch. See [^003-partitioned-inodeMap-POC.tar.gz]. Also created a remote branch called {{fgl}} in hadoop repo with both patches applied to current trunk. [~xinglin] is working on adding {{create()}} call to FGL. Right now only {{mkdirs()}} is supported. was (Author: shv): Updated the POC patches to current trunk. There were indeed some missing parts in the first patch. See [^003-partitioned-inodeMap-POC.tar.gz]. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341089#comment-17341089 ] Konstantin Shvachko edited comment on HDFS-14703 at 5/8/21, 1:05 AM: - Updated the POC patches to current trunk. There were indeed some missing parts in the first patch. See [^003-partitioned-inodeMap-POC.tar.gz]. was (Author: shv): Updated the POC patches. There were indeed some missing parts in the first patch. See [003-partitioned-inodeMap-POC.tar.gz|https://issues.apache.org/jira/secure/attachment/13025177/003-partitioned-inodeMap-POC.tar.gz]. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341089#comment-17341089 ] Konstantin Shvachko edited comment on HDFS-14703 at 5/8/21, 1:04 AM: - Updated the POC patches. There were indeed some missing parts in the first patch. See [003-partitioned-inodeMap-POC.tar.gz|https://issues.apache.org/jira/secure/attachment/13025177/003-partitioned-inodeMap-POC.tar.gz]. was (Author: shv): Updated the POC patches. There were indeed some missing parts in the first patch. See [https://issues.apache.org/jira/secure/attachment/13025177/003-partitioned-inodeMap-POC.tar.gz|https://issues.apache.org/jira/secure/attachment/13025177/003-partitioned-inodeMap-POC.tar.gz]. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, > 002-partitioned-inodeMap-POC.tar.gz, 003-partitioned-inodeMap-POC.tar.gz, > NameNode Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187409#comment-17187409 ] junbiao chen edited comment on HDFS-14703 at 8/31/20, 6:19 AM: --- [~hexiaoqiao] I want to do some work on this issue ,could you tell me which version does the patch base on?thanks was (Author: dahaishuantuoba): [~hexiaoqiao] I want to do some work on this issue ,could you which version does the patch base on?thanks > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187409#comment-17187409 ] junbiao chen edited comment on HDFS-14703 at 8/31/20, 4:01 AM: --- [~hexiaoqiao] I want to do some work on this issue ,could you which version does the patch base on?thanks was (Author: dahaishuantuoba): Which version does the patch base on?thanks > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf, NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919948#comment-16919948 ] Arpit Agarwal edited comment on HDFS-14703 at 8/30/19 11:04 PM: Interesting proposal [~shv] . Thanks for sharing this and the PoC patch. I went through the doc and the idea seems interesting. I didn't understand how the partitioning scheme works. Do atomic rename and snapshots still work as before with these changes? Did you measure write throughput improvement with {{dfs.namenode.edits.asynclogging}}? was (Author: arpitagarwal): Interesting proposal [~shv] . Thanks for sharing this and the PoC patch. I went through the doc and the idea seems interesting. I didn't understand how the partitioning scheme works. Do atomic rename and snapshots still as before with these changes? Did you measure write throughput improvement with {{dfs.namenode.edits.asynclogging}}? > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: 001-partitioned-inodeMap-POC.tar.gz, NameNode > Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909681#comment-16909681 ] He Xiaoqiao edited comment on HDFS-14703 at 8/17/19 12:41 PM: -- Thanks [~shv] for your POC patches. I have to state that this is very clever design for fine-grained global locking. There are still couple of questions what I do not quite understand and look forward to your response. 1. Write concurrency control. Consider one case with two threads with mkdir (/a/b/c/d/e) and delete(/a/b/c) ops. I try to ran this case following design and POC patches, but I usually get unstable result since key with and could be located at different RangeGSet using {{INodeMap#latchWriteLock}}, then the two threads could run concurrently and get unstable result even if from one client and one by one. As your last explains, `deleting a directory should lock all RangeGets involved`. Is it one special case about Delete Ops? Sorry for asking this question again. {quote} Deleting a directory /a/b/c means deleting the entire sub-tree underneath this directory. We should lock all RangeGSets involved in such deletion, particularly the one containing file f. So f cannot be modified concurrently with the delete. {quote} 2. {{INode}} involves local variable {{long[] namespaceKey}} at 0004 in POC package. I believe this attributes is very useful to partition for INode. meanwhile does it bring some other potential issues * heap footprint overhead. For a long while running of NameNode process, namespaceKey of most INode (visited once at least) in the directory tree may be not null. If we consider there are 500M INodes and {{level}} is both 2, it need over than 8GB heap size. * when one INode is renamed, the {{namespaceKey}} have to update, right? Since its parent INode has changes. POC seems not update anymore if {{namespaceKey}} is not null. Is it possible to calculate namespaceKey for INode when use it out of the Lock. Of course, it will bring CPU overhead. Please correct me if I am wrong. Thanks. 3. No LatchLock unlock in the POC for operation #mkdir, it seems like a bit of oversight. In my opinion, it has to release childLock after used, right? [~shv] Thanks for your POC patches again and looks forward to the next milestone. And I would like to involve to push forward this feature if need. was (Author: hexiaoqiao): Thanks [~shv] for your POC patches. I have to state that this is very clever design for fine-grained global locking. There are still couple of questions what I do not quite understand and look forward to your response. 1. Write concurrency control. Consider one case with two threads with mkdir (/a/b/c/d/e) and delete(/a/b/c) ops. I try to ran this case following design and POC patches, but I usually get unstable result since key with and could be located at different RangeGSet using {{INodeMap#latchWriteLock}}, then the two threads could run concurrently and get unstable result even if from one client and one by one. As your last explains, `deleting a directory should lock all RangeGets involved`. Is it one special case about Delete Ops? Sorry for asking this question again. {quote} Deleting a directory /a/b/c means deleting the entire sub-tree underneath this directory. We should lock all RangeGSets involved in such deletion, particularly the one containing file f. So f cannot be modified concurrently with the delete. {quote} 2. {{INode}} involves local variable {{long[] namespaceKey}} at 0004 in POC package. I believe this attributes is very useful to partition for INode. meanwhile does it bring some other potential issues * heap footprint overhead. For a long while running of NameNode process, namespaceKey of most INode (visited once at least) in the directory tree may be not null. If we consider there are 500M INodes and {{level}} is both 2, it need over than 8GB heap size. * when one INode is renamed, the {{namespaceKey}} have to update, right? Since its parent INode has changes. POC seems not update anymore if {{namespaceKey}} is not null. Is it possible to calculate namespaceKey for INode when use it out of the Lock. Of course, it will bring CPU overhead. Please correct me if I am wrong. Thanks. 3. No LatchLock unlock in the POC for operation #mkdir, it seems like a bit of oversight. In my opinion, it has to release childLock after used, right? [~shv] Thanks for your POC patches again and looks forward to the next milestone. And I would like to involve to push forward this feature if need. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903374#comment-16903374 ] Konstantin Shvachko edited comment on HDFS-14703 at 8/8/19 10:10 PM: - Hi [~hexiaoqiao], thanks for reviewing the doc. Very good questions: # "Cousins" means files like {{/a/b/c/d}} and {{/a/b/m/n}}. They will have keys, respectively, {{}} and {{}}, which have common prefix {{}} and therefore are likely to fall into the same RangeGSet. In your example {{}} is the parent of {{}} and this key definition does not guarantee them to be in the same range. # Deleting a directory {{/a/b/c}} means deleting the entire sub-tree underneath this directory. We should lock all RangeGSets involved in such deletion, particularly the one containing file {{f}}. So {{f}} cannot be modified concurrently with the delete. # Just to clarify RangeMap is the upper level part of PartitionedGSet, which maps key ranges into RangeGSets. So there is only one RangeMap and many RangeGSets. Holding a lock on RangeMap is akin to holding a global lock. You make a good point that some operations like failover, large deletes, renames, quota changes will still require a global lock. The lock on RangeMap could play the role of such global lock. This should be defined in more details within the design of LatchLock. Ideally we should retain FSNamesystemLock as a global lock for some operations. This will also help us gradually switch operations from FSNamesystemLock to LatchLock. # I don't know what the next bottleneck we will see, but you are absolutely correct there will be something. For edits log, I indeed saw while running my benchmarks that the number of transactions batched together while journaling was increasing. This is expected and desirable behavior, since writing large batches to a disk is more efficient than lots of small writes. was (Author: shv): Hi [~hexiaoqiao], thanks for reviewing the doc. Very good questions: # "Cousins" means files like {{/a/b/c/d}} and {{/a/b/m/n}}. They will have keys, respectively, {{}} and {{}}, which have common prefix {{}} and therefore are likely to fall into the same RangeGSet. In your example {{}} is the parent of {{}} and this key definition does not guarantee them to be in the same range. # Deleting a directory {{/a/b/c}} means deleting the entire sub-tree underneath this directory. We should lock all RangeGSets involved in such deletion, particularly the one containing containing file {{f}}. So {{f}} cannot be modified concurrently with the delete. # Just to clarify RangeMap is the upper level part of PartitionedGSet, which maps key ranges into RangeGSets. So there is only one RangeMap and many RangeGSets. Holding a lock on RangeMap is akin to holding a global lock. You make a good point that some operations like failover, large deletes, renames, quota changes will still require a global lock. The lock on RangeMap could play the role of such global lock. This should be defined in more details within the design of LatchLock. Ideally we should retain FSNamesystemLock as a global lock for some operations. This will also help us gradually switch operations from FSNamesystemLock to LatchLock. # I don't know what the next bottleneck we will see, but you are absolutely correct there will be something. For edits log, I indeed saw while running my benchmarks that the number of transactions batched together while journaling was increasing. This is expected and desirable behavior, since writing large batches to a disk is more efficient than lots of small writes. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14703) NameNode Fine-Grained Locking via Metadata Partitioning
[ https://issues.apache.org/jira/browse/HDFS-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901887#comment-16901887 ] He Xiaoqiao edited comment on HDFS-14703 at 8/7/19 1:04 PM: Thanks [~shv] for file this JIRA and plan to push this feature forward, it is very great work. Really appreciate doing this. There are some details I am confused after reading the design document. As design document said, each inode maps (through inode key) to one RangeMap who has a separate lock and carry out concurrently. {quote}The inode key is a fixed length sequence of parent inodeids ending with the file inode id itself: key(f) = Where selfId is the inodeId of file f, pId is the id of its parent, and ppId is the id of the parent of the parent. Such definition of a key guarantees that not only siblings but also cousins (objects having the same grandparent) are partitioned into the same range most of the time {quote} Consider the following path: /a/b/c/d, corresponding inode id is [ida, idb, idc, idd]. 1. How could we guarantee to map 'cousins' into the same range? In my first opinion, it could map to different RangeMaps, since for idc, its inode key = and for idd its inode key = . Furthermore, if we rename one inode from one range to another one, do we need to bring it's all children and sub-tree inode transfer to another range? 2. Any consideration about operating one nodes and its ancestor node concurrently? for instance, /a/b/c/d/e/f, we could delete inode c and modify inode f at the same time if they map to different range since we do not guarantee map them to the same one. maybe it is problem in the case. 3. Which lock will be hold if request some global request like ha failover, safemode etc.? do we need to obtain all RangeMap lock? 4. Any bottleneck meet after improve write throughput, I believe that EditLog OPS will keep increase, and will it to be the new bottleneck? Please correct me if I do not understand correctly. Thanks. was (Author: hexiaoqiao): Thanks [~shv] for file this JIRA and plan to push this feature forward, it is very great work. Really appreciate doing this. There are some details I am confused after reading the design document. As design document said, each inode maps (through inode key) to one RangeMap who has a separate lock and carry out concurrently. {quote}The inode key is a fixed length sequence of parent inodeids ending with the file inode id itself: key(f) = Where selfId is the inodeId of file f, pId is the id of its parent, and ppId is the id of the parent of the parent. Such definition of a key guarantees that not only siblings but also cousins (objects having the same grandparent) are partitioned into the same range most of the time {quote} Consider the following path: /a/b/c/d/e, corresponding inode id is [ida, idb, idc, idd]. 1. How we could guarantee to map 'cousins' into the same range? In my first opinion, it could map to different RangeMaps, since for idc, its inode key = and for idd its inode key = . 2. Any consideration about operating one nodes and its ancestor node concurrently? for instance, /a/b/c/d/e/f, we could delete inode c and modify inode f at the same time if they map to different range since we do not guarantee map them to the same one. maybe it is problem in the case. 3. Which lock will be hold if request some global request like ha failover, safemode etc.? do we need to obtain all RangeMap lock? 4. Any bottleneck meet after improve write throughput, I believe that EditLog OPS will keep increase, and will it to be the new bottleneck? Please correct me if I do not understand correctly. Thanks. > NameNode Fine-Grained Locking via Metadata Partitioning > --- > > Key: HDFS-14703 > URL: https://issues.apache.org/jira/browse/HDFS-14703 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namenode >Reporter: Konstantin Shvachko >Priority: Major > Attachments: NameNode Fine-Grained Locking.pdf > > > We target to enable fine-grained locking by splitting the in-memory namespace > into multiple partitions each having a separate lock. Intended to improve > performance of NameNode write operations. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org