[ https://issues.apache.org/jira/browse/HDFS-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886041#comment-13886041 ]
Haohui Mai commented on HDFS-5698: ---------------------------------- I prototyped a parallelized version of the FSImageLoader. Here is the number running on the same environment: |Size in Old|512M|1G|2G|4G|8G| |Loading in Old (ms)|12819|24664|48240|114090|307689| |Loading in PB(Parallel) (ms)|17927|32997|64581|138306|373391| The main changes of the prototype are: # Compute the MD5 checksum in a separate thread. # Parallelize the construction of INodes (with 6 threads) # Use a dedicated thread to update the blocks map. Currently I haven't parallelize the construction of INodeDirectory yet, but it seems to me that the performance numbers are reasonable (recall that coming out the safe mode automatically will take 30 seconds). I think we can fine tune the performance after the work is merged into trunk. > Use protobuf to serialize / deserialize FSImage > ----------------------------------------------- > > Key: HDFS-5698 > URL: https://issues.apache.org/jira/browse/HDFS-5698 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Haohui Mai > Assignee: Haohui Mai > Attachments: HDFS-5698.000.patch, HDFS-5698.001.patch > > > Currently, the code serializes FSImage using in-house serialization > mechanisms. There are a couple disadvantages of the current approach: > # Mixing the responsibility of reconstruction and serialization / > deserialization. The current code paths of serialization / deserialization > have spent a lot of effort on maintaining compatibility. What is worse is > that they are mixed with the complex logic of reconstructing the namespace, > making the code difficult to follow. > # Poor documentation of the current FSImage format. The format of the FSImage > is practically defined by the implementation. An bug in implementation means > a bug in the specification. Furthermore, it also makes writing third-party > tools quite difficult. > # Changing schemas is non-trivial. Adding a field in FSImage requires bumping > the layout version every time. Bumping out layout version requires (1) the > users to explicitly upgrade the clusters, and (2) putting new code to > maintain backward compatibility. > This jira proposes to use protobuf to serialize the FSImage. Protobuf has > been used to serialize / deserialize the RPC message in Hadoop. > Protobuf addresses all the above problems. It clearly separates the > responsibility of serialization and reconstructing the namespace. The > protobuf files document the current format of the FSImage. The developers now > can add optional fields with ease, since the old code can always read the new > FSImage. -- This message was sent by Atlassian JIRA (v6.1.5#6160)