[ https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17331414#comment-17331414 ]
Qi Zhu commented on HDFS-14617: ------------------------------- cc [~sodonnell] [~hexiaoqiao] Could you help backport to 3.2.2 and 3.2.1 ? Our production clusters need to use this in 3.2.2. Thanks. > Improve fsimage load time by writing sub-sections to the fsimage index > ---------------------------------------------------------------------- > > Key: HDFS-14617 > URL: https://issues.apache.org/jira/browse/HDFS-14617 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Fix For: 2.10.0, 3.3.0 > > Attachments: HDFS-14617.001.patch, ParallelLoading.svg, > SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, > flamegraph.serial.svg, inodes.svg > > > Loading an fsimage is basically a single threaded process. The current > fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, > Snapshot_Diff etc. Then at the end of the file, an index is written that > contains the offset and length of each section. The image loader code uses > this index to initialize an input stream to read and process each section. It > is important that one section is fully loaded before another is started, as > the next section depends on the results of the previous one. > What I would like to propose is the following: > 1. When writing the image, we can optionally output sub_sections to the > index. That way, a given section would effectively be split into several > sections, eg: > {code:java} > inode_section offset 10 length 1000 > inode_sub_section offset 10 length 500 > inode_sub_section offset 510 length 500 > > inode_dir_section offset 1010 length 1000 > inode_dir_sub_section offset 1010 length 500 > inode_dir_sub_section offset 1010 length 500 > {code} > Here you can see we still have the original section index, but then we also > have sub-section entries that cover the entire section. Then a processor can > either read the full section in serial, or read each sub-section in parallel. > 2. In the Image Writer code, we should set a target number of sub-sections, > and then based on the total inodes in memory, it will create that many > sub-sections per major image section. I think the only sections worth doing > this for are inode, inode_reference, inode_dir and snapshot_diff. All others > tend to be fairly small in practice. > 3. If there are under some threshold of inodes (eg 10M) then don't bother > with the sub-sections as a serial load only takes a few seconds at that scale. > 4. The image loading code can then have a switch to enable 'parallel loading' > and a 'number of threads' where it uses the sub-sections, or if not enabled > falls back to the existing logic to read the entire section in serial. > Working with a large image of 316M inodes and 35GB on disk, I have a proof of > concept of this change working, allowing just inode and inode_dir to be > loaded in parallel, but I believe inode_reference and snapshot_diff can be > make parallel with the same technique. > Some benchmarks I have are as follows: > {code:java} > Threads 1 2 3 4 > -------------------------------- > inodes 448 290 226 189 > inode_dir 326 211 170 161 > Total 927 651 535 488 (MD5 calculation about 100 seconds) > {code} > The above table shows the time in seconds to load the inode section and the > inode_directory section, and then the total load time of the image. > With 4 threads using the above technique, we are able to better than half the > load time of the two sections. With the patch in HDFS-13694 it would take a > further 100 seconds off the run time, going from 927 seconds to 388, which is > a significant improvement. Adding more threads beyond 4 has diminishing > returns as there are some synchronized points in the loading code to protect > the in memory structures. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org