whbing commented on a change in pull request #2918:
URL: https://github.com/apache/hadoop/pull/2918#discussion_r615265357



##########
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/offlineImageViewer/PBImageTextWriter.java
##########
@@ -649,14 +679,123 @@ private void output(Configuration conf, FileSummary 
summary,
         is = FSImageUtil.wrapInputStreamForCompression(conf,
             summary.getCodec(), new BufferedInputStream(new LimitInputStream(
                 fin, section.getLength())));
-        outputINodes(is);
+        INodeSection s = INodeSection.parseDelimitedFrom(is);
+        LOG.info("Found {} INodes in the INode section", s.getNumInodes());
+        int count = outputINodes(is, out);
+        LOG.info("Outputted {} INodes.", count);
       }
     }
     afterOutput();
     long timeTaken = Time.monotonicNow() - startTime;
     LOG.debug("Time to output inodes: {}ms", timeTaken);
   }
 
+  /**
+   * STEP1: Multi-threaded process sub-sections
+   * Given n (1<n<=k) threads to process k sections,
+   * E.g. 10 sections and 4 threads, grouped as follows:
+   * |---------------------------------------------------------------|
+   * | (0    1    2)    (3    4    5)    (6    7)     (8    9)       |
+   * | thread[0]        thread[1]        thread[2]    thread[3]      |
+   * |---------------------------------------------------------------|
+   *
+   * STEP2: Merge files.
+   */
+  private void outputInParallel(Configuration conf, FileSummary summary,

Review comment:
       Thanks @Hexiaoqiao for guidance. Other possible sub-sections can also be 
optimized, but may not be the focus of optimization, i think. Analyse as below.
   
   There are several steps to parse fsimage in the case of DELIMITED format:
   - 1) Load string table                        
   - 2) Load inode references              
   - 3) Handle INODE to memory or levelDB  
   - 4) Handle INODE_DIR to memory or levelDB         
   - 5) Output INODE                                                    
   For example In our practice, it takes 7 hours for to parse a large fsimage 
file, and just the  5th step which only uses INODE takes more than 6 hours. So 
I did parallelization in 5th step.
   
   The 3rd and 4th steps are basically memory operations, which are not very 
time-consuming. It may be possible to use INODE_SUB or INODE_DIR_SUB feature 
for parallel processing, but I am not sure if it is necessary to do so.
   
   Hope to discuss further to clarify whether other sub-sections need to be 
processed, Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to