[ https://issues.apache.org/jira/browse/HDFS-15987?focusedWorklogId=584637&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-584637 ]
ASF GitHub Bot logged work on HDFS-15987: ----------------------------------------- Author: ASF GitHub Bot Created on: 17/Apr/21 15:03 Start Date: 17/Apr/21 15:03 Worklog Time Spent: 10m Work Description: whbing commented on a change in pull request #2918: URL: https://github.com/apache/hadoop/pull/2918#discussion_r615265357 ########## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/PBImageTextWriter.java ########## @@ -649,14 +679,123 @@ private void output(Configuration conf, FileSummary summary, is = FSImageUtil.wrapInputStreamForCompression(conf, summary.getCodec(), new BufferedInputStream(new LimitInputStream( fin, section.getLength()))); - outputINodes(is); + INodeSection s = INodeSection.parseDelimitedFrom(is); + LOG.info("Found {} INodes in the INode section", s.getNumInodes()); + int count = outputINodes(is, out); + LOG.info("Outputted {} INodes.", count); } } afterOutput(); long timeTaken = Time.monotonicNow() - startTime; LOG.debug("Time to output inodes: {}ms", timeTaken); } + /** + * STEP1: Multi-threaded process sub-sections + * Given n (1<n<=k) threads to process k sections, + * E.g. 10 sections and 4 threads, grouped as follows: + * |---------------------------------------------------------------| + * | (0 1 2) (3 4 5) (6 7) (8 9) | + * | thread[0] thread[1] thread[2] thread[3] | + * |---------------------------------------------------------------| + * + * STEP2: Merge files. + */ + private void outputInParallel(Configuration conf, FileSummary summary, Review comment: Thanks @Hexiaoqiao for guidance. Other possible sub-sections can also be optimized, but may not be the focus of optimization, i think. Analyse as below. There are several steps to parse fsimage in the case of DELIMITED format: - 1) Load string table - 2) Load inode references - 3) Handle INODE to memory or levelDB - 4) Handle INODE_DIR to memory or levelDB - 5) Output INODE For example In our practice, it takes 7 hours for to parse a large fsimage file, and just the 5th step which only uses INODE takes more than 6 hours. So I did parallelization in 5th step. The 3rd and 4th steps are basically memory operations, which are not very time-consuming. It may be possible to use INODE_SUB or INODE_DIR_SUB feature for parallel processing, but I am not sure if it is necessary to do so. Hope to discuss further to clarify whether other sub-sections need to be processed, Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 584637) Time Spent: 1h (was: 50m) > Improve oiv tool to parse fsimage file in parallel with delimited format > ------------------------------------------------------------------------ > > Key: HDFS-15987 > URL: https://issues.apache.org/jira/browse/HDFS-15987 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Hongbing Wang > Assignee: Hongbing Wang > Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > The purpose of this Jira is to improve oiv tool to parse fsimage file with > sub-sections (see -HDFS-14617-) in parallel with delmited format. > 1.Serial parsing is time-consuming > The time to serially parse a large fsimage with delimited format (e.g. `hdfs > oiv -p Delimited -t <tmp> ...`) is as follows: > {code:java} > 1) Loading string table: -> Not time consuming. > 2) Loading inode references: -> Not time consuming > 3) Loading directories in INode section: -> Slightly time consuming (3%) > 4) Loading INode directory section: -> A bit time consuming (11%) > 5) Output: -> Very time consuming (86%){code} > Therefore, output is the most parallelized stage. > 2.How to output in parallel > The sub-sections are grouped in order, and each thread processes a group and > outputs it to the file corresponding to each thread, and finally merges the > output files. > 3. The result of a test > {code:java} > input fsimage file info: > 3.4G, 12 sub-sections, 55976500 INodes > ----------------------------------------- > Threads TotalTime OutputTime MergeTime > 1 18m37s 16m18s – > 4 8m7s 4m49s 41s{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org