[ 
https://issues.apache.org/jira/browse/HDFS-16019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangyi Zhu resolved HDFS-16019.
--------------------------------
    Resolution: Later

> HDFS: Inode CheckPoint 
> -----------------------
>
>                 Key: HDFS-16019
>                 URL: https://issues.apache.org/jira/browse/HDFS-16019
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namanode
>    Affects Versions: 3.3.1
>            Reporter: Xiangyi Zhu
>            Assignee: Xiangyi Zhu
>            Priority: Major
>
> *background*
> The OIV IMAGE analysis tool has brought us many benefits, such as file size 
> distribution, cold and hot data, abnormal growth directory analysis. But in 
> my opinion he is too slow, especially the big IMAGE.
> After Hadoop 2.3, the format of IMAGE has changed. For OIV tools, it is 
> necessary to load the entire IMAGE into the memory to output the inode 
> information into a text format. For large IMAGE, this process takes a long 
> time and consumes more resources and requires a large memory machine to 
> analyze.
> Although, HDFS provides the dfs.namenode.legacy-oiv-image.dir parameter to 
> get the old version of IMAGE through CheckPoint. The old IMAGE parsing does 
> not require too many resources, but we need to parse the IMAGE again through 
> the hdfs oiv_legacy command to get the text information of the Inode, which 
> is relatively time-consuming.
> **
> *Solution*
> We can ask the standby node to periodically check the Inode and serialize the 
> Inode in text mode. For OutPut, different FileSystems can be used according 
> to the configuration, such as the local file system or the HDFS file system.
> The advantage of providing HDFS file system is that we can analyze Inode 
> directly through spark/hive. I think the block information corresponding to 
> the Inode may not be of much use. The size of the file and the number of 
> copies are more useful to us.
> In addition, the sequential output of the Inode is not necessary. We can 
> speed up the CheckPoint for the Inode, and use the partition for the 
> serialized Inode to output different files. Use a production thread to put 
> Inode in the Queue, and use multi-threaded consumption Queue to write to 
> different partition files. For output files, compression can also be used to 
> reduce disk IO.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to